You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ana Gillan <an...@gmail.com> on 2014/08/13 12:57:41 UTC

Optimising Map and Reduce numbers for Compressed Dataset

Hi,

I am currently experimenting with using Hive for a large dataset and I am
having trouble with optimising the jobs.

The dataset I have is quite a large number of fairly small gzipped files,
but I am carrying out a series of transformations on it, which means that
the size of the data being processed by the mappers and reducers is
significantly larger than the input data. As such, a very small number of
mappers and reducers are launched, but it takes a very long time to finish
any job.

Hive is using CombineFileInputFormat and I have also set
hive.hadoop.supports.splittable.combineinputformat to true because the
files are compressed. What other settings should I be looking at? Is there
a way to specify the number of map tasks?

Thanks,
Ana