You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Benyi Wang <be...@gmail.com> on 2012/06/18 20:46:49 UTC

hive.merge properties with RCFile

I try to use Hive merge options to merge the smallfiles into a large files
using the following query. It is working well except that I cannot control
the output file size. I cannot explain why the output files are always
256MB using the following hive.merge.size.per.task and
hive.merge.smallfiles.avgsize
settings. Tried 56MB for hive.merge.size.per.task, the size is still 256MB.

"omniture_hit" is an uncompressed CSV file format hive table. I want to
convert it into RCFile format. The problem is that there will a lot of
small RCFiles created which are much smaller than our default block size
128M if I just simple select * and insert into the new table.

Another problem is that I want to change hive.io.rcfile.record.size to 8MB
to see if there is more compression ratio for my data. But the result seems
similar compared with 4MB. The data pattern could be like that as RCFile
paper said. But how can I verify if my setting to 8MB works?

Thanks.

Ben

SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;

set hive.merge.size.per.task=28*1024*1024;
set hive.merge.smallfiles.avgsize=100000000;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.created.files=150000;

create table omniture_hit_rc like omniture_hit;

insert overwrite table omniture_hit_rc partition (local_dt) select *
from omniture_hit where local_dt>='2012-06-01';

Re: hive.merge properties with RCFile

Posted by Edward Capriolo <ed...@gmail.com>.
This will not work.

set hive.merge.size.per.task=28*1024*1024;

It has to be a number.

On Mon, Jun 18, 2012 at 2:46 PM, Benyi Wang <be...@gmail.com> wrote:
> I try to use Hive merge options to merge the smallfiles into a large files
> using the following query. It is working well except that I cannot control
> the output file size. I cannot explain why the output files are always
> 256MB using the following hive.merge.size.per.task and
> hive.merge.smallfiles.avgsize
> settings. Tried 56MB for hive.merge.size.per.task, the size is still 256MB.
>
> "omniture_hit" is an uncompressed CSV file format hive table. I want to
> convert it into RCFile format. The problem is that there will a lot of
> small RCFiles created which are much smaller than our default block size
> 128M if I just simple select * and insert into the new table.
>
> Another problem is that I want to change hive.io.rcfile.record.size to 8MB
> to see if there is more compression ratio for my data. But the result seems
> similar compared with 4MB. The data pattern could be like that as RCFile
> paper said. But how can I verify if my setting to 8MB works?
>
> Thanks.
>
> Ben
>
> SET hive.exec.compress.output=true;
> SET hive.exec.compress.intermediate=true;
>
> set hive.merge.size.per.task=28*1024*1024;
> set hive.merge.smallfiles.avgsize=100000000;
> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.created.files=150000;
>
> create table omniture_hit_rc like omniture_hit;
>
> insert overwrite table omniture_hit_rc partition (local_dt) select *
> from omniture_hit where local_dt>='2012-06-01';