You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Matthias Scherer <ma...@1und1.de> on 2014/12/02 14:16:28 UTC

Merge of compressed RCFile leads to uneven file sizes

Hi All,

I am trying to merge gzip compressed RCFile output to one single file per partition. Hive version is 0.10:

SET hive.exec.compress.intermediate=true;
SET mapred.compress.map.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=256000000;

After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (...) SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only merge job) looks like this:

000000_0             file         8.15 MB
000001_0             file         7.88 MB
000002_0             file         5.2 MB
...
000013_0             file         700.56 KB
000014_0             file         574.59 KB

Why is the largest file more than 10 times bigger than the smallest? Why are they sorted by filesize descending? And why is it not 1 single file?

I tested the same table and Statement also with STORED AS SEQUENCEFILE, and the result was 1 single output file.

Regards
Matthias