You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Olaf Collider <ol...@gmail.com> on 2016/05/11 17:17:38 UTC

should I set a different number of mappers?

I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
space (*200TB*).

In my pig script, I load several monthly files that are about *200Gb* in
size each.

I load my data like this

data = LOAD 'mypath/data_2015*'

          USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')

then I FILTER the data and probably remove 80% of the data after the filter
step.

I noticed that if I load in my pig script about one year of data, Pig
creates about 15k mappers and the whole process takes approximately 3 hours
(including the reduce step).

Instead, if I load 2 years of data, then Pig creates about 30k mappers and
basically all the nodes become unhealthy after processing for more than 15
hours.

Am I hitting some kind of bottleneck here? Or there is some default options
I should play with?

Many thanks!

Re: should I set a different number of mappers?

Posted by Rohini Palaniswamy <ro...@gmail.com>.

It does not.

You have not mentioned how many CPU cores you have though. One thing that
you should watch out for when you have beefy nodes is disk storms. Please
monitor your CPU and disk utilization. They might require some special
tuning.  For eg: If you have 48 cores and only 12 disks, the disks will
become a bottleneck if there is lot of IO happening in parallel. We had to
switch to deadline scheduling instead of the default cfq (
http://stackoverflow.com/questions/9338378/what-is-the-difference-between-cfq-deadline-and-noop)
for better IO performance. vm.dirty_background_ratio/dirty_background_bytes
and vm.dirty_ratio/dirty_bytes (
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/)
are other settings that you might want to look at if you are having long
pauses due to too much being cached in memory (Even if it was 20%, with
384G RAM it is 76G which is a lot) and being flushed to disk when it
becomes full. In our setups we had to lower this value.

Regards,
Rohini

On Fri, May 27, 2016 at 3:45 PM, Olaf Collider <ol...@gmail.com>
wrote:

> Hello Rohini
>
> Super helpful, thanks!
> I was able to get the exact characteristics of my cluster. Here it is:
>
> Block size 128MB, 300TB of raw data storage (100TB if you account for
> replication) and each of the 4 nodes has 384GB RAM
>
> Does that change your answer?
>
> Thanks again!!
>
> On 27 May 2016 at 17:09, Rohini Palaniswamy <ro...@gmail.com>
> wrote:
> > 15K mappers on a 4 node system will definitely crash it unless you have
> > tuned yarn (RM, NM) well. That many mappers reading data off few disks in
> > parallel can create disk storm and disk can also turn out to be your
> bottle
> > neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize  default
> > value) of data. 15K mappers means you are reading 1.9 TB of data. Based
> on
> > the number of memory capacity you have, you can reduce the number of
> > mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G
> RAM
> > and some memory taken for node manager, data node, etc) and you are
> running
> > mappers with 1G heap (mapreduce.map.java.opts) and 1.5G
> > (mapreduce.map.memory.mb) container sizes, you can run 117 containers in
> > parallel.
> >
> > set pig.maxCombinedSplitSize 10737418240
> >
> > Above setting will make each map process 10G of data which will create
> > about ~190 maps and you should be able to run without bringing down your
> > cluster.
> >
> > Regards,
> > Rohini
> >
> > On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <olaf.collider@gmail.com
> >
> > wrote:
> >
> >> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
> >> space (*200TB*).
> >>
> >> In my pig script, I load several monthly files that are about *200Gb* in
> >> size each.
> >>
> >> I load my data like this
> >>
> >> data = LOAD 'mypath/data_2015*'
> >>
> >>           USING
> com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
> >>
> >> then I FILTER the data and probably remove 80% of the data after the
> filter
> >> step.
> >>
> >> I noticed that if I load in my pig script about one year of data, Pig
> >> creates about 15k mappers and the whole process takes approximately 3
> hours
> >> (including the reduce step).
> >>
> >> Instead, if I load 2 years of data, then Pig creates about 30k mappers
> and
> >> basically all the nodes become unhealthy after processing for more than
> 15
> >> hours.
> >>
> >> Am I hitting some kind of bottleneck here? Or there is some default
> options
> >> I should play with?
> >>
> >> Many thanks!
> >>
>

Re: should I set a different number of mappers?

Posted by Olaf Collider <ol...@gmail.com>.

Hello Rohini

Super helpful, thanks!
I was able to get the exact characteristics of my cluster. Here it is:

Block size 128MB, 300TB of raw data storage (100TB if you account for
replication) and each of the 4 nodes has 384GB RAM

Does that change your answer?

Thanks again!!

On 27 May 2016 at 17:09, Rohini Palaniswamy <ro...@gmail.com> wrote:
> 15K mappers on a 4 node system will definitely crash it unless you have
> tuned yarn (RM, NM) well. That many mappers reading data off few disks in
> parallel can create disk storm and disk can also turn out to be your bottle
> neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize  default
> value) of data. 15K mappers means you are reading 1.9 TB of data. Based on
> the number of memory capacity you have, you can reduce the number of
> mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G RAM
> and some memory taken for node manager, data node, etc) and you are running
> mappers with 1G heap (mapreduce.map.java.opts) and 1.5G
> (mapreduce.map.memory.mb) container sizes, you can run 117 containers in
> parallel.
>
> set pig.maxCombinedSplitSize 10737418240
>
> Above setting will make each map process 10G of data which will create
> about ~190 maps and you should be able to run without bringing down your
> cluster.
>
> Regards,
> Rohini
>
> On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <ol...@gmail.com>
> wrote:
>
>> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
>> space (*200TB*).
>>
>> In my pig script, I load several monthly files that are about *200Gb* in
>> size each.
>>
>> I load my data like this
>>
>> data = LOAD 'mypath/data_2015*'
>>
>>           USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
>>
>> then I FILTER the data and probably remove 80% of the data after the filter
>> step.
>>
>> I noticed that if I load in my pig script about one year of data, Pig
>> creates about 15k mappers and the whole process takes approximately 3 hours
>> (including the reduce step).
>>
>> Instead, if I load 2 years of data, then Pig creates about 30k mappers and
>> basically all the nodes become unhealthy after processing for more than 15
>> hours.
>>
>> Am I hitting some kind of bottleneck here? Or there is some default options
>> I should play with?
>>
>> Many thanks!
>>

Re: should I set a different number of mappers?

Posted by Rohini Palaniswamy <ro...@gmail.com>.

15K mappers on a 4 node system will definitely crash it unless you have
tuned yarn (RM, NM) well. That many mappers reading data off few disks in
parallel can create disk storm and disk can also turn out to be your bottle
neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize  default
value) of data. 15K mappers means you are reading 1.9 TB of data. Based on
the number of memory capacity you have, you can reduce the number of
mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G RAM
and some memory taken for node manager, data node, etc) and you are running
mappers with 1G heap (mapreduce.map.java.opts) and 1.5G
(mapreduce.map.memory.mb) container sizes, you can run 117 containers in
parallel.

set pig.maxCombinedSplitSize 10737418240

Above setting will make each map process 10G of data which will create
about ~190 maps and you should be able to run without bringing down your
cluster.

Regards,
Rohini

On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <ol...@gmail.com>
wrote:

> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
> space (*200TB*).
>
> In my pig script, I load several monthly files that are about *200Gb* in
> size each.
>
> I load my data like this
>
> data = LOAD 'mypath/data_2015*'
>
>           USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
>
> then I FILTER the data and probably remove 80% of the data after the filter
> step.
>
> I noticed that if I load in my pig script about one year of data, Pig
> creates about 15k mappers and the whole process takes approximately 3 hours
> (including the reduce step).
>
> Instead, if I load 2 years of data, then Pig creates about 30k mappers and
> basically all the nodes become unhealthy after processing for more than 15
> hours.
>
> Am I hitting some kind of bottleneck here? Or there is some default options
> I should play with?
>
> Many thanks!
>