HDFS and Hive storage - comparing file formats and compression methods

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load the same dataset into the new table. Among all the queries, we tested the “sequence file”, “text file” and “RCFILE” formats and the “default”, “bz”, “gz”, “LZO” and “Snappy” compression codecs.

April 4th 2012: Answer to Huchev comment relative to LZ4.

Setup

The environment is a 20 nodes and 120 terabytes Hadoop cluster running Cloudera CDH3U3. The original dataset is a 1.33 Go folder with 80 compressed and unsplitable “bz2” files inside. The data inside is formatted as CSV for a total of about 125,000,000 lines.

Below is an example of a Hive Query importing data using a “RCFILE” coming from HBase format with “LZO” compression:

-- Prepare
CREATE TABLE rc_lzo (
    client BIGINT, ctime INT, mtime INT,
    code STRING, value_1 INT, value_2 INT
) STORED AS RCFILE;
-- Compression
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
-- Import
INSERT OVERWRITE TABLE rc_lzo
SELECT * FROM (
    SELECT
        client, round(ctime / 1000), round(mtime / 1000),
        code, value_1, value_2
    FROM staging
) T;

Results

The table below are the results we obtained. The query columns describe the type of test. The query name starts with the file format followed by the compression codec. Both “block” and “record” compression types have been reported for the “sequence file” format and the “default” compression codec.

The “serdesf” family of queries makes use of a custom SerDe which encodes each column in an shorter size when appropriate. In our use case, the code can be stored as 1 character (1 byte), the value_2 can be represented as the diff between itself and the value_1 (2 bytes). Overall, a line is stored as 16 bytes compared to 65 bytes originally.

The “bss” query uses the BinarySortableSerDe serialization which is a custom Hive Serde that we associated to the “sequence file” format.

The “b64” family of queries uses the base64 package present in the Hive contrib project.

query	time	size
sf	2mn 3s	7.91 Go
sf_string	2mn 22s	8.72 Go
sf_df_block	2mn 17s	8.72 Go
sf_df_record	2mn 12s	7.32 Go
sf_bz	2h 43mn 24s	9.9 Go
sf_gz	2mn 29s	8.72 Go
sf_lzo	2mn 36s	8.80 Go
sf_snappy	3mn 55s	8.23 Go
tf	1mn 45s	6.44 Go
tf_bz	2mn 14s	1.12 Go
tf_df	2mn 16s	1.12 Go
tf_gz	48s	1.34 Go
tf_lzo	1mn 28s	2.41 Go
tf_snappy	1mn 2s	2.55 Go
rc	1mn 30s	5.78 Go
rc_df	5mn 15s	917.68 Mo
rc_gz	4mn 36s	917.80 Mo
rc_snappy	52s	1.85 Go
rc_lzo	38s	1.67 Go
serdesf	59s	3.63 Go
serdesf_df	1mn 27s	4.61 Go
serdesf_bz	3h 6mn 9s	9.63 Go
serdesf_gz	1mn 51s	6.02 Go
serdesf_snappy	1mn 35s	4.80 Go
bss	1mn 25s	5.73 Go
b64	2mn 5s	9.17 Go
b64_bz	21mn 15s	1.14 Go
b64_df	21mn 25s	1.14 Go
b64_gz	53s	1.62 Go
b64_snappy	59s	2.89 Go

Some notes

We wish we had run the tests on a larger input data set with a more common file format but cluster time is a scarce resource at the moment. As a result, the file sizes are probably representative but the speed results should be interpreted with care.

The speed reflects importing time only (with not so standard files as input) and not how fast those formats are against map/reduce jobs.

As part of the test, we also tested block type of compression on other type of codecs but they had no effect so we came to the conclusion that block type only apply to the default codec in sequence files.

We tried to run the serdesf test in a similar mode but using the RCFILE format instead of sequence file but the results are identical to the rc family of queries.

Interpretation

The “tf” query act as a reference since it stores our data in an uncompressed CSV format. It is kind of awkward to see that all the “sequence file” queries generate larger file size. This isn’t something we expected but maybe is it due to our high usage of the integer type.

In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results. The “base64” SerDe using “sequence file” with “bz” and “default” compression aren’t so far away. However, those “base64” results are slow enough to be bypassed.

In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate.

About LZ4

We didn’t test LZ4. From our understanding of HADOOP-7657, the support for LZ4 target Hadoop version 0.23.1, 0.24.0 and is not backported to our running Cloudera CDH3U3. If you are curious about LZ4, here’s an interesing article comparing LZ4 to Snappy. Also worth of interest is the in-memory benchmark of the fastest compressors.

Share this article