You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Matthias Scherer <ma...@1und1.de> on 2014/12/02 14:16:28 UTC
Merge of compressed RCFile leads to uneven file sizes
Hi All,
I am trying to merge gzip compressed RCFile output to one single file per partition. Hive version is 0.10:
SET hive.exec.compress.intermediate=true;
SET mapred.compress.map.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=256000000;
After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (...) SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only merge job) looks like this:
000000_0 file 8.15 MB
000001_0 file 7.88 MB
000002_0 file 5.2 MB
...
000013_0 file 700.56 KB
000014_0 file 574.59 KB
Why is the largest file more than 10 times bigger than the smallest? Why are they sorted by filesize descending? And why is it not 1 single file?
I tested the same table and Statement also with STORED AS SEQUENCEFILE, and the result was 1 single output file.
Regards
Matthias