You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Harshit Sharan <hs...@gmail.com> on 2016/04/02 10:13:32 UTC

Reduce number of Hadoop mappers for large number of GZ files

Hi,

I have a use case where I have 3072 gz files over which I am building a
HIVE table. Now, whenever I run a query over this table, the query spawns
3072 mappers, and takes around 44 mins to complete. Earlier, the same data
(i.e. equal data size) was present in 384 files. The same queries took
around 9 mins only.

I searched the web, where I found that the number of mappers are decided by
the number of "splits" of the i/p data. Hence, setting the parameters:
mapreduce.input.fileinputformat.split.minsize and mapreduce.input.
fileinputformat.split.maxsize

to a high value like 64 MB would cause each mapper to take up 64 MB worth
of data, even if that requires processing multiple files by same mapper.

But, this solution doesn't work for my case, since GZ files are of
"non-splittable" format. Hence, they can not be split across mappers or
joined to be processed by a single mapper.

Has anyone faced this problem too?

There can be various solutions to this, like uncompressing the gz files and
then using above params to have lesser number of mappers, or using higher
end ec2 instances to reduce processing time. But, is there an inherent
solution in Hadoop/Hive/EMR to tackle this?

Thanks in advance for any help!
-- 
*Regards,*
*Harshit Sharan*
*Software Development Engineer*