You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by KingDavies <ki...@gmail.com> on 2014/02/03 19:20:02 UTC

Optimising mappers for number of nodes

Our platform has a 40GB raw data file that was compressed lzo (12GB
compressed) to reduce network IO between S3.
Without indexing the file is unsplittable resulting in 1 map task and poor
cluster utilisation.
After indexing the file to be splitable the hive query produces 120 map
tasks.
However, with the 120 tasks distributed over a small 4 node cluster it
takes longer to process the data than when it wasn't splitable and
processing done by a single node (1h20mins vs 17mins). This was with a
fairly simple select from where query, without distinct, group by or order.
I'd like to utilise all nodes in the cluster, to reduce query time. Whats
the best way to have the data crunched in parallel but with fewer mappers?

Re: Optimising mappers for number of nodes

Posted by Lefty Leverenz <le...@gmail.com>.

Actually that's mapred.max.split.size.  Hive doesn't have a configuration
parameter named "hive.max.split.size".

-- Lefty


On Mon, Feb 3, 2014 at 10:59 AM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi
>
> hive.max.split.size can be tuned to decrease the number of mappers.
> Reference: http://www.slideshare.net/ye.mikez/hive-tuning (slide number
> 38)
>
> Also using CombineHiveInputFormat (default input format) will combine
> multiple small files to form a large split and hence less number of mappers.
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 3, 2014, at 10:20 AM, KingDavies <ki...@gmail.com> wrote:
>
> Our platform has a 40GB raw data file that was compressed lzo (12GB
> compressed) to reduce network IO between S3.
> Without indexing the file is unsplittable resulting in 1 map task and poor
> cluster utilisation.
> After indexing the file to be splitable the hive query produces 120 map
> tasks.
> However, with the 120 tasks distributed over a small 4 node cluster it
> takes longer to process the data than when it wasn't splitable and
> processing done by a single node (1h20mins vs 17mins). This was with a
> fairly simple select from where query, without distinct, group by or order.
> I'd like to utilise all nodes in the cluster, to reduce query time. Whats
> the best way to have the data crunched in parallel but with fewer mappers?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Optimising mappers for number of nodes

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi 

hive.max.split.size can be tuned to decrease the number of mappers. Reference: http://www.slideshare.net/ye.mikez/hive-tuning (slide number 38)

Also using CombineHiveInputFormat (default input format) will combine multiple small files to form a large split and hence less number of mappers.

Thanks
Prasanth Jayachandran

On Feb 3, 2014, at 10:20 AM, KingDavies <ki...@gmail.com> wrote:

> Our platform has a 40GB raw data file that was compressed lzo (12GB compressed) to reduce network IO between S3.
> Without indexing the file is unsplittable resulting in 1 map task and poor cluster utilisation.
> After indexing the file to be splitable the hive query produces 120 map tasks.
> However, with the 120 tasks distributed over a small 4 node cluster it takes longer to process the data than when it wasn’t splitable and processing done by a single node (1h20mins vs 17mins). This was with a fairly simple select from where query, without distinct, group by or order.
> I’d like to utilise all nodes in the cluster, to reduce query time. Whats the best way to have the data crunched in parallel but with fewer mappers?


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.