You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mohit Gupta <su...@gmail.com> on 2011/11/30 08:18:21 UTC

CombineHiveInputFormat and Merge files not working for compressed text files

Hi All,
I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
files into a few larger files( basically merging a number of partitions for
a table into one). On doing the obvious query, i.e.( insert into a new
partition select * from all partitions), a large number of small files are
generated in the new partition. ( map-only job with no of output files
equal to the no of mappers).

Note: The table being processed here is stored in compressed format on s3.
set hive.exec.compress.output = true;
set mapred.output.compression.codec =
org.apache.hadoop.io.compress.GzipCodec;
set io.seqfile.compression.type = BLOCK;

I found a couple of solutions on net but sadly neither of them work for me:
1. Merging small files
I set the following parameters:
set hive.merge.mapfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=100000000;
set hive.merge.mapredfiles=true;
 set hive.merge.smallfiles.avgsize=1000000000;
 set hive.merge.size.smallfiles.avgsize=1000000000;

Ideally, there should have been a reduce job after the map-only job to
merge the small output files into a small no. of files. But, I could see no
reduce job.

2. Using CombineHiveInputFormat
Parameters Set:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size.per.node=1000000000;
set mapred.min.split.size.per.rack=1000000000;
 set mapred.max.split.size=1000000000;

Ideally, here the no. of mappers created should have been considerably less
than the no of input files, thereby producing a small no. of output files
equal to the no. of mappers. But, I found the same no of mappers as no of
input files.

------
Specifics:
Approx size of small files: 125 KB
No of small files >6k

I found a couple of links saying that this merging stuff did not work for
compressed files but now it is fixed.
Any ideas how can I fix this!

Thanks in Advance.

-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.

Re: CombineHiveInputFormat and Merge files not working for compressed text files

Posted by Igor Tatarinov <ig...@decide.com>.

I might be wrong but I think EMR inserts a reduce job when writing data
into S3. At least in my case, I am able to create a single output file by

SET mapred.reduce.tasks = 1;
INSERT OVERWRITE TABLE price_history_s3
...

Without using any a combined format. The number of mappers _is_ determined
by the number of input files. But I think you can't use a combined input
format with Gzip files.

Perhaps you could run a separate query for each partition?

igor
decide.com


On Tue, Nov 29, 2011 at 11:18 PM, Mohit Gupta <success.mohit.gupta@gmail.com
> wrote:

> Hi All,
> I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
> files into a few larger files( basically merging a number of partitions for
> a table into one). On doing the obvious query, i.e.( insert into a new
> partition select * from all partitions), a large number of small files are
> generated in the new partition. ( map-only job with no of output files
> equal to the no of mappers).
>
> Note: The table being processed here is stored in compressed format on s3.
> set hive.exec.compress.output = true;
> set mapred.output.compression.codec =
> org.apache.hadoop.io.compress.GzipCodec;
> set io.seqfile.compression.type = BLOCK;
>
> I found a couple of solutions on net but sadly neither of them work for me:
> 1. Merging small files
> I set the following parameters:
> set hive.merge.mapfiles=true;
> set hive.merge.size.per.task=256000000;
> set hive.merge.smallfiles.avgsize=100000000;
> set hive.merge.mapredfiles=true;
>  set hive.merge.smallfiles.avgsize=1000000000;
>  set hive.merge.size.smallfiles.avgsize=1000000000;
>
> Ideally, there should have been a reduce job after the map-only job to
> merge the small output files into a small no. of files. But, I could see no
> reduce job.
>
> 2. Using CombineHiveInputFormat
> Parameters Set:
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size.per.node=1000000000;
> set mapred.min.split.size.per.rack=1000000000;
>  set mapred.max.split.size=1000000000;
>
> Ideally, here the no. of mappers created should have
> been considerably less than the no of input files, thereby producing a
> small no. of output files equal to the no. of mappers. But, I found the
> same no of mappers as no of input files.
>
> ------
> Specifics:
> Approx size of small files: 125 KB
> No of small files >6k
>
> I found a couple of links saying that this merging stuff did not work for
> compressed files but now it is fixed.
> Any ideas how can I fix this!
>
> Thanks in Advance.
>
> --
> Best Regards,
>
> Mohit Gupta
> Software Engineer at Vdopia Inc.
>
>
>