Merging multiple files in Hadoop
By David WORMS
Jan 12, 2013
- Categories
- Hack
- Tags
- File system
- Hadoop
- HDFS [more][less]
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of hadoop fs
but contrary to “getmerge”, the final merged file isn’t put into the local filesystem but inside HDFS.
Here’s how it looks like
echo '' > /tmp/test; hadoop fs -getmerge /user/hdfs/source/**/*/tmp/test & cat /tmp/test | hadoop fs -put - /user/hdfs/merged; rm /tmp/test
Here’s what happens. We start by creating a temporary file in “/tmp/test”. We run the “getmerge” command and at the same time, it’s generated content is piped into the Hadoop “put” command. Notice the ”-” just after “-put” which tells Hadoop to get its content from stdin. Finally, we remove the temporary file.
You can check the result for your command by comparing the file size of the origin directory and the one of the generated file:
hadoop fs -du -s /user/hdfs/source hadoop fs -du -s /user/hdfs/merged
You could also use a “cat” implementation but the globing was more restrictive in my test. In both case, this isn’t efficient. You are downloading locally the content and even temporary storing it. You could eventually save the storage part if you have HDFS mounted locally.
Latest versions of HDFS will ship with concat functionnalities as documented in HDFS-222.