You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mohit Gupta <su...@gmail.com> on 2011/11/30 08:18:21 UTC
CombineHiveInputFormat and Merge files not working for compressed
text files
Hi All,
I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
files into a few larger files( basically merging a number of partitions for
a table into one). On doing the obvious query, i.e.( insert into a new
partition select * from all partitions), a large number of small files are
generated in the new partition. ( map-only job with no of output files
equal to the no of mappers).
Note: The table being processed here is stored in compressed format on s3.
set hive.exec.compress.output = true;
set mapred.output.compression.codec =
org.apache.hadoop.io.compress.GzipCodec;
set io.seqfile.compression.type = BLOCK;
I found a couple of solutions on net but sadly neither of them work for me:
1. Merging small files
I set the following parameters:
set hive.merge.mapfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=100000000;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1000000000;
set hive.merge.size.smallfiles.avgsize=1000000000;
Ideally, there should have been a reduce job after the map-only job to
merge the small output files into a small no. of files. But, I could see no
reduce job.
2. Using CombineHiveInputFormat
Parameters Set:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size.per.node=1000000000;
set mapred.min.split.size.per.rack=1000000000;
set mapred.max.split.size=1000000000;
Ideally, here the no. of mappers created should have been considerably less
than the no of input files, thereby producing a small no. of output files
equal to the no. of mappers. But, I found the same no of mappers as no of
input files.
------
Specifics:
Approx size of small files: 125 KB
No of small files >6k
I found a couple of links saying that this merging stuff did not work for
compressed files but now it is fixed.
Any ideas how can I fix this!
Thanks in Advance.
--
Best Regards,
Mohit Gupta
Software Engineer at Vdopia Inc.
Re: CombineHiveInputFormat and Merge files not working for compressed
text files
Posted by Igor Tatarinov <ig...@decide.com>.
I might be wrong but I think EMR inserts a reduce job when writing data
into S3. At least in my case, I am able to create a single output file by
SET mapred.reduce.tasks = 1;
INSERT OVERWRITE TABLE price_history_s3
...
Without using any a combined format. The number of mappers _is_ determined
by the number of input files. But I think you can't use a combined input
format with Gzip files.
Perhaps you could run a separate query for each partition?
igor
decide.com
On Tue, Nov 29, 2011 at 11:18 PM, Mohit Gupta <success.mohit.gupta@gmail.com
> wrote:
> Hi All,
> I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
> files into a few larger files( basically merging a number of partitions for
> a table into one). On doing the obvious query, i.e.( insert into a new
> partition select * from all partitions), a large number of small files are
> generated in the new partition. ( map-only job with no of output files
> equal to the no of mappers).
>
> Note: The table being processed here is stored in compressed format on s3.
> set hive.exec.compress.output = true;
> set mapred.output.compression.codec =
> org.apache.hadoop.io.compress.GzipCodec;
> set io.seqfile.compression.type = BLOCK;
>
> I found a couple of solutions on net but sadly neither of them work for me:
> 1. Merging small files
> I set the following parameters:
> set hive.merge.mapfiles=true;
> set hive.merge.size.per.task=256000000;
> set hive.merge.smallfiles.avgsize=100000000;
> set hive.merge.mapredfiles=true;
> set hive.merge.smallfiles.avgsize=1000000000;
> set hive.merge.size.smallfiles.avgsize=1000000000;
>
> Ideally, there should have been a reduce job after the map-only job to
> merge the small output files into a small no. of files. But, I could see no
> reduce job.
>
> 2. Using CombineHiveInputFormat
> Parameters Set:
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size.per.node=1000000000;
> set mapred.min.split.size.per.rack=1000000000;
> set mapred.max.split.size=1000000000;
>
> Ideally, here the no. of mappers created should have
> been considerably less than the no of input files, thereby producing a
> small no. of output files equal to the no. of mappers. But, I found the
> same no of mappers as no of input files.
>
> ------
> Specifics:
> Approx size of small files: 125 KB
> No of small files >6k
>
> I found a couple of links saying that this merging stuff did not work for
> compressed files but now it is fixed.
> Any ideas how can I fix this!
>
> Thanks in Advance.
>
> --
> Best Regards,
>
> Mohit Gupta
> Software Engineer at Vdopia Inc.
>
>
>