You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2012/01/11 21:07:26 UTC

Does anyone know if io.seqfile.compress.blocksize does anything?

Hadoop 0.20.2 Hive 0.7.X

I got convinced that installing google-snappy would be awesome, so I spent
the day it took to build and patch snappy in. I actually found that I did
not get good compression from snappy 30% smaller vs 50% from gzip. That is
another story.

I decided to start playing with:

set io.seqfile.compress.blocksize=10000000;

Since all the tuning blogs on the internet suggest it. (they also commonly
misname variables like http://code.google.com/p/hadoop-snappy/ compression
not compress.)

Check this out.
set io.seqfile.compression.type=BLOCK;
set mapred.compress.map.output=true;
set
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set io.seqfile.compress.blocksize=10000000;
create table act_seq_snappy_10mbblock stored as sequencefile as select *
from fracture_act where hit_date=20120106 and mid>001400 and mid<001420;
set io.seqfile.compress.blocksize=20000000;
create table act_seq_snappy_20mbblock stored as sequencefile as select *
from fracture_act where hit_date=20120106 and mid>001400 and mid<001420;


hive> dfs -count
hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_20mbblock
> ;
1 2 414559506
hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_20mbblock
hive> dfs -count
hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_10mbblock
> ;
1 2 414559506
hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_10mbblock

How can it be that chosing two different block sizes results in exactly the
same file size result?

Also tried.
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.type=BLOCK;

Also tried gzip not snappy.

Has anyone every actually experienced io.seqfile.compress.blocksize
working? My first idea is that hive is swallowing this somehow and not
passing it along to hadoop, however after reading all the "performance
blogs" talking about it I am mildly convinced this variable does nothing,
since all the "performance blogs" rarely even get variable names right.

Edward