HDFS and Hive storage - comparing file formats and compression methods
By David WORMS
Mar 13, 2012
- Categories
- Big Data
- Tags
- Business intelligence
- Hive
- ORC
- Parquet
- File Format [more][less]
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load the same dataset into the new table. Among all the queries, we tested the “sequence file”, “text file” and “RCFILE” formats and the “default”, “bz”, “gz”, “LZO” and “Snappy” compression codecs.
April 4th 2012: Answer to Huchev comment relative to LZ4.
Setup
The environment is a 20 nodes and 120 terabytes Hadoop cluster running Cloudera CDH3U3. The original dataset is a 1.33 Go folder with 80 compressed and unsplitable “bz2” files inside. The data inside is formatted as CSV for a total of about 125,000,000 lines.
Below is an example of a Hive Query importing data using a “RCFILE” coming from HBase format with “LZO” compression:
-- Prepare
CREATE TABLE rc_lzo (
client BIGINT, ctime INT, mtime INT,
code STRING, value_1 INT, value_2 INT
) STORED AS RCFILE;
-- Compression
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
-- Import
INSERT OVERWRITE TABLE rc_lzo
SELECT * FROM (
SELECT
client, round(ctime / 1000), round(mtime / 1000),
code, value_1, value_2
FROM staging
) T;
Results
The table below are the results we obtained. The query columns describe the type of test. The query name starts with the file format followed by the compression codec. Both “block” and “record” compression types have been reported for the “sequence file” format and the “default” compression codec.
The “serdesf” family of queries makes use of a custom SerDe which encodes each column in an shorter size when appropriate. In our use case, the code
can be stored as 1 character (1 byte), the value_2
can be represented as the diff between itself and the value_1
(2 bytes). Overall, a line is stored as 16 bytes compared to 65 bytes originally.
The “bss” query uses the BinarySortableSerDe
serialization which is a custom Hive Serde that we associated to the “sequence file” format.
The “b64” family of queries uses the base64
package present in the Hive contrib project.
query | time | size |
---|---|---|
sf | 2mn 3s | 7.91 Go |
sf_string | 2mn 22s | 8.72 Go |
sf_df_block | 2mn 17s | 8.72 Go |
sf_df_record | 2mn 12s | 7.32 Go |
sf_bz | 2h 43mn 24s | 9.9 Go |
sf_gz | 2mn 29s | 8.72 Go |
sf_lzo | 2mn 36s | 8.80 Go |
sf_snappy | 3mn 55s | 8.23 Go |
tf | 1mn 45s | 6.44 Go |
tf_bz | 2mn 14s | 1.12 Go |
tf_df | 2mn 16s | 1.12 Go |
tf_gz | 48s | 1.34 Go |
tf_lzo | 1mn 28s | 2.41 Go |
tf_snappy | 1mn 2s | 2.55 Go |
rc | 1mn 30s | 5.78 Go |
rc_df | 5mn 15s | 917.68 Mo |
rc_gz | 4mn 36s | 917.80 Mo |
rc_snappy | 52s | 1.85 Go |
rc_lzo | 38s | 1.67 Go |
serdesf | 59s | 3.63 Go |
serdesf_df | 1mn 27s | 4.61 Go |
serdesf_bz | 3h 6mn 9s | 9.63 Go |
serdesf_gz | 1mn 51s | 6.02 Go |
serdesf_snappy | 1mn 35s | 4.80 Go |
bss | 1mn 25s | 5.73 Go |
b64 | 2mn 5s | 9.17 Go |
b64_bz | 21mn 15s | 1.14 Go |
b64_df | 21mn 25s | 1.14 Go |
b64_gz | 53s | 1.62 Go |
b64_snappy | 59s | 2.89 Go |
Some notes
We wish we had run the tests on a larger input data set with a more common file format but cluster time is a scarce resource at the moment. As a result, the file sizes are probably representative but the speed results should be interpreted with care.
The speed reflects importing time only (with not so standard files as input) and not how fast those formats are against map/reduce jobs.
As part of the test, we also tested block
type of compression on other type of codecs but they had no effect so we came to the conclusion that block
type only apply to the default codec in sequence files.
We tried to run the serdesf
test in a similar mode but using the RCFILE
format instead of sequence file
but the results are identical to the rc
family of queries.
Interpretation
The “tf” query act as a reference since it stores our data in an uncompressed CSV format. It is kind of awkward to see that all the “sequence file” queries generate larger file size. This isn’t something we expected but maybe is it due to our high usage of the integer type.
In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results. The “base64” SerDe using “sequence file” with “bz” and “default” compression aren’t so far away. However, those “base64” results are slow enough to be bypassed.
In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate.
About LZ4
We didn’t test LZ4. From our understanding of HADOOP-7657, the support for LZ4 target Hadoop version 0.23.1, 0.24.0 and is not backported to our running Cloudera CDH3U3. If you are curious about LZ4, here’s an interesing article comparing LZ4 to Snappy. Also worth of interest is the in-memory benchmark of the fastest compressors.