Download datasets into HDFS and Hive
By Aida NGOM
Jul 31, 2020
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Introduction
Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the presence of many open data sources such as those listed on the Awesome Public Datasets hosted on GitHub. These availability of datasets will make possible to carry out multiples tests on different type of relevant data. For example, in order to perform few comparison tests between file format (CSV, JSON,…), we have isolated different types of datasets respectively in Hadoop HDFS and Hive.
Through this paper, we found interesting to share our script to load datasets directly into them. It download them efficiently into HDFS and illustrates the required pre-processing to expose them into Hive. For the sake of clarity, we only import one dataset, the New York City Taxi. Feel free to adjust the settings to fit your targeted dataset.
Loading data into HDFS
We created a bash script for loading the data into HDFS. The New York Taxi dataset is composed of multiple files organised by dates. Here’s an example URL: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-01.csv
.
For each URL entry present in the file datasets_to_download
, the script download the content with curl
and pipe it into the HDFS put
command. This way, no record is stored on the host executing the command, preventing it to saturate. This is important because this host machine is often not sized to store large amount of data. For example, when teaching, we often ask our student to connect to an edge node, which is a Linux container, and ask them to download a dataset, triggering more than 40 downloads in parallel.
#!/bin/bash
username=`whoami`
hostname="cluster_hostname_or_ip"
for line in $(cat ./datasets_to_download)
do
# The filename variable retrieves the name of the data that is usually present at the end of the download link.
filename =`awk -F"/" '{print $NF}' <<< "${line}"`
dir = "/user/${username}/taxi_trip/staging"
sshpass -f <(printf '%s\n' your_password) ssh $user@$hostname \
"hdfs dfs -put <(curl -sS $line) $dir/$filename"
done
NOTE: the shell script should be user executable by this above command, for example with chmod u+x scripts.sh
.
Sometime you may deal with a link which doesn’t provide the name of the data. For exemple, the University of Illinois provide a slighly enhanced version of the New York City trip data. It has reduced information compare to those provides by the original website. The download URLs look like:
https://public.boxcloud.com/d/1/b1!wgmpX-s_1eUYtZ1bbQNuWD-BOpAxq8NzBk0Vg-qGOdsU9GyqlKu_iLszh_UaZey6yA8bjcz7sYVjNH3JCrii-lb6VPcotyAnlsTZJHzLPl7zvvon6W2Hh-PolWX0Iz-ZCT4LZGvhpi0p-F5_cIbuyQ3ZjW7jtVR11DMrqs1QESdUKjeoaCyCFnkVv9QkASQfF0zQKhYCdhPX8_5a7-_plD3NnxQW4WNwqXJQv4OK-J-oFfkAjcRBNpxu3-zLtl4v9QafdaSEICun4K3FIlOovwmtfhRTlzN9mywnf7e46qNbqPO8Zjbrt3GgPxp6L5ZviSec6RcgaRgu2O940ZvT_7i4u95s3XOlf3sIib8Gw8Xw8cr0juDpYM228BCSZtdBqL1pO_CFjNeVz7Fwj1R0vKCuYUEf7pM9igDUAJNClysA-RaMdBqFKphphd0m_dM8Vm2g8PhoLCEgBX9l04Qzdgwyk1ckebpE2Nb64hqHYHuO0VLiRUZJEw6dCpufKgvv5iHf6YNTuFXNuNlJKdU2cQKMIamG94xW510FYWTari-EqfbKVapYF9hKOOldE0ex4tOnVhAPPhpzRKk-g1UfVCsOqzhV38__U3OuzSNJSdq9oq1_PMw4ZMle6AlBaGfyXfz3cBwgQLutW_dXIQaibouAdyJ7n-2sXhnDyd8UeEORwkYehv64-rMmj6mZ4FreJl_YZQ_bIc54aYkYQz2A3-VzAdg66KXD6sEIr2pMsrK2l3cT9MeY2DshgVLl0_zIoZlL2GT2grLna3VFT4iw67kIFQmviYa5Uxm6lzoFsdaVocO9zOUyMhR6k5E4hWkHpFflwT9XSBY6s_Es6YaOFuCJLODIs0hKUm9bGa2CvvexqNnUr_RAZVqkwL25z7BClrq-nPcscUAYIYbE9E1g3Qdw8g1pksh4UQ_-YAuIKo3xAIqzXt0cV6ao5l63ldksBxyCH9v4wtp4BZvRm_cnaT29s7ppDEXyArxOcPW5CbpVfz8L-8H5sXzXmuDPbRASChlX4QtHTX83BpGN1p-fcj_52Ob4UGPXzxdA6KR5CZ9uZFK4GACEKiZxxbQNppcL-l3zd_iaY7w45seYmCvx6SA44xOEDDzKNVnkA4g3LP47dR4Msbi-NMJHJFXeI-1gywM1krey_fKCmjKe_dKP6j2n7WeyYlyS1kBqpQYSRC8chUcgGKCz_Guc4YaZ7n79NZP6zUOIMbWQOGwxpvVeCduKim7f0VJlNV3kgi71xwQ./download
In the above cases, the name of the data for each link could be defined as following, with an incremented index after the filename:
#!/bin/bash
username=`whoami`
hostname="cluster_hostname_or_ip"
for line in $(cat ./datasets_to_download)
do
let i=i+1
filename="trip_data_${i}.zip"
dir= "/user/${username}/taxi_trip/staging"
sshpass -f <(printf '%s\n' your_password) ssh $user@$hostname \
"hdfs dfs -put <(curl -sS $line) $dir/$filename"
done
In case of compressed data to gzip format, the files can be decompressed prior to their HDFS upload with the gzip -d
command:
unzip_filename=$"${name%.*}"_"csv"
hdfs dfs -put <(curl -sS $line -L | gzip -d) $dir/$unzip_filename
Data declaration in Hive
Declaring data in Hive implies their schema creation. In our case, we have downloaded CSV files where records are separated by \n
(Unix line separator) and fields are separated by ,
(comma). Each file corresponds to a year of data and they are all inside the same HDFS folder. We can take advantages of Hive internals which associate a table with an HDFS directory, not an HDFS file, and consider all the files inside this directory as the whole dataset.
The following query create a Hive table which combines the taxi trips data from every years. Replace the LOCATION
value with the HDFS path storing your downloaded files. The table is temporary, we will use it to create a more optimized table later and erase it once it is done.
CREATE EXTERNAL TABLE taxi_trip_staging
(
medallion Numeric, hack_license Numeric,
vendor_id String, rate_code Numeric,
store_and_fwd_flag Numeric, pickup_datetime string,
dropoff_datetime string, passenger_count Numeric,
trip_time_in_secs Numeric, trip_distance Numeric,
pickup_longitude Numeric, pickup_latitude Numeric,
dropoff_longitude Numeric, dropoff_latitude Numeric
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/user/${env:USER}/taxi_trip/staging'
TBLPROPERTIES("skip.header.line.count"="1");
The data set is now available from Hive. We will create a second table in order to take advantage of Hive partitioning. The data will be partitionned by year and by month. The first part the of the query create a new partionned table and the second part load the table from the data of the previously non-partitionned table called taxi_trip_staging
.
CREATE EXTERNAL TABLE taxi_trip
(
medallion Numeric, hack_license Numeric,
vendor_id String, rate_code Numeric,
store_and_fwd_flag Numeric,pickup_datetime string,
dropoff_datetime string, passenger_count Numeric,
trip_time_in_secs Numeric, trip_distance Numeric,
pickup_longitude Numeric, pickup_latitude Numeric ,
dropoff_longitude Numeric, dropoff_latitude Numeric
)
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/user/${env:USER}/taxi_trip/gold';
-- Load the data
INSERT OVERWRITE TABLE taxi_trip PARTITION(year, month)
SELECT
medallion,hack_license, vendor_id, rate_code, store_and_fwd_flag,
pickup_datetime, dropoff_datetime, passenger_count, trip_time_in_secs,
trip_distance, pickup_longitude, pickup_latitude, dropoff_longitude,
dropoff_latitude, year(pickup_datetime), month(pickup_datetime)
FROM taxi_trip_staging;
-- Remove the temporary table (not its data)
DROP TABLE taxi_trip_staging;
The data is now accessible through Hive and you can take advantage of the benefits it offers.
Conclusion
I hope the little script we shared and its recommandations will help you to load data efficiently into HDFS and Hive. Memory and disk usage will remain low on the host running the command. Once the data is loaded inside Hive, you can benefit of its speed and ease for your data processing needs (data summarization, ad-hoc query,…).