Adaltas

HDFS and Hive storage - comparing file formats and compression methods

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load the same dataset into the new table. Among all the queries, we tested the “sequence file”, “text file” and “RCFILE” formats and the “default”, “bz”, “gz”, “LZO” and “Snappy” compression codecs.

April 4th 2012: Answer to Huchev comment relative to LZ4.

Setup

The environment is a 20 nodes and 120 terabytes Hadoop cluster running Cloudera CDH3U3. The original dataset is a 1.33 Go folder with 80 compressed and unsplitable “bz2” files inside. The data inside is formatted as CSV for a total of about 125,000,000 lines.

Below is an example of a Hive Query importing data using a “RCFILE” format with “LZO” compression:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Prepare
CREATE TABLE rc_lzo (
    client BIGINT, ctime INT, mtime INT,
    code STRING, value_1 INT, value_2 INT
) STORED AS RCFILE;
-- Compression
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
-- Import
INSERT OVERWRITE TABLE rc_lzo
SELECT * FROM (
    SELECT
        client, round(ctime / 1000), round(mtime / 1000),
        code, value_1, value_2
    FROM staging
) T;

Results

The table below are the results we obtained. The query columns describe the type of test. The query name starts with the file format followed by the compression codec. Both “block” and “record” compression types have been reported for the “sequence file” format and the “default” compression codec.

The “serdesf” family of queries makes use of a custom SerDe which encode each column in an shorter size when appropriate. In our use case, the code can be stored as 1 character (1 byte), the value_2 can be represented as the diff between itself and the value_1 (2 bytes). Overall, a line is stored as 16 bytes compared to 65 bytes originally.

The “bss” query uses the BinarySortableSerDe serialization which is a custom Hive Serde that we associated to the “sequence file” format.

The “b64” family of queries use the base64 package present in the Hive contrib project.

query time size
sf 2mn 3s 7.91 Go
sf_string 2mn 22s 8.72 Go
sf_df_block 2mn 17s 8.72 Go
sf_df_record 2mn 12s 7.32 Go
sf_bz 2h 43mn 24s 9.9 Go
sf_gz 2mn 29s 8.72 Go
sf_lzo 2mn 36s 8.80 Go
sf_snappy 3mn 55s 8.23 Go
tf 1mn 45s 6.44 Go
tf_bz 2mn 14s 1.12 Go
tf_df 2mn 16s 1.12 Go
tf_gz 48s 1.34 Go
tf_lzo 1mn 28s 2.41 Go
tf_snappy 1mn 2s 2.55 Go
rc 1mn 30s 5.78 Go
rc_df 5mn 15s 917.68 Mo
rc_gz 4mn 36s 917.80 Mo
rc_snappy 52s 1.85 Go
rc_lzo 38s 1.67 Go
serdesf 59s 3.63 Go
serdesf_df 1mn 27s 4.61 Go
serdesf_bz 3h 6mn 9s 9.63 Go
serdesf_gz 1mn 51s 6.02 Go
serdesf_snappy 1mn 35s 4.80 Go
bss 1mn 25s 5.73 Go
b64 2mn 5s 9.17 Go
b64_bz 21mn 15s 1.14 Go
b64_df 21mn 25s 1.14 Go
b64_gz 53s 1.62 Go
b64_snappy 59s 2.89 Go

Some notes

We wish we had run the tests on a larger input data set with a more common file format but cluster time is a scarce resource at the moment. As a result, the file sizes are probably representative but the speed results should be interpreted with care.

The speed reflects importing time only (with not so standard files as input) and not how fast those formats are against map/reduce jobs.

As part of the test, we also tested block type of compression on other type of codecs but they had no effect so we came to the conclusion that block type only apply to the default codec in sequence files.

We tried to run the serdesf test in a similar mode but using the RCFILE format instead of sequence file but the results are identical to the rc family of queries.

Interpretation

The “tf” query act as a reference since it stores our data in an uncompressed CSV format. It is kind of awkward to see that all the “sequence file” queries generate larger file size. This isn’t something we expected but maybe is it due to our high usage of the integer type.

In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results. The “base64” SerDe using “sequence file” with “bz” and “default” compression aren’t so far away. However, those “base64” results are slow enough to be bypassed.

In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate.

About LZ4

We didn’t test LZ4. From our understanding of HADOOP-7657, the support for LZ4 target Hadoop version 0.23.1, 0.24.0 and is not backported to our running Cloudera CDH3U3. If you are curious about LZ4, here’s an interesing article comparing LZ4 to Snappy. Also worth of interest is the in-memory benchmark of the fastest compressors.

Comments