You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/03/06 15:28:58 UTC

too many memory spills

Hello,

I have a file of size 9GB and having approximately 109.5 million records.
I execute a pig script on this file that is doing:
1. Group by on a field of the file
2. Count number of records in every group
3. Store the result in a CSV file using normal PigStorage(",")

The job is completed successfully but the job details show a lot of memory
spills. *Out of 109.5 million records, it shows approximately 48 million
records spilled.*

I am executing it on a* 4 node cluster each with a dual core processor and
4GB ram*.

How can I minimize the amount of record spills. It really makes the
execution really slow when the spilling starts.

Any suggestions are welcome.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: too many memory spills

Posted by Norbert Burger <no...@gmail.com>.

I thought Todd Lipcon's Hadoop Summit presentation [1] had some good info
on this topic.

[1] http://www.slideshare.net/cloudera/mr-perf

Norbert

On Thu, Mar 7, 2013 at 7:25 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> You can do a few things here
>
>
>    1. Increase mapred.child.java.opts to a higher number (default is
>    200MB). You will have to do this while making sure (# of MR slots/node X
>    mapred.child.java.opts + 387 < 4GB). May be you want to stay under 3.5GB
>    based on other stuff running on those nodes.
>    2. Increase "mapred.job.shuffle.input.buffer.percent" to have more heap
>    be available for the shuffle
>    3.
>    4. Set mapred.inmem.merge.threshold to 0
>    and mapred.job.reduce.input.buffer.percent to 0.8
>
> You will have to play around with these to see what works for your needs.
>
> You can additionally refer to "Hadoop: Definitive Guide" for tips on config
> tuning.
>
> On Thu, Mar 7, 2013 at 1:01 PM, Panshul Whisper <ouchwhisper@gmail.com
> >wrote:
>
> > Hello Prashant,
> >
> > I have a CDH installation and by default memory allocated to each task
> > tracker is 387 MB.
> > And yes these spills are happening on Map and Reduce side.
> >
> > Still not solved this problem...
> >
> > Suggestions are welcome.
> >
> > Thanking You,
> >
> > Regards,
> >
> >
> > On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <prash1784@gmail.com
> > >wrote:
> >
> > > Are these spills happening on map or reduce side? What is the memory
> > > allocated to each TaskTracker?
> > >
> > > On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <ouchwhisper@gmail.com
> > > >wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a file of size 9GB and having approximately 109.5 million
> > records.
> > > > I execute a pig script on this file that is doing:
> > > > 1. Group by on a field of the file
> > > > 2. Count number of records in every group
> > > > 3. Store the result in a CSV file using normal PigStorage(",")
> > > >
> > > > The job is completed successfully but the job details show a lot of
> > > memory
> > > > spills. *Out of 109.5 million records, it shows approximately 48
> > million
> > > > records spilled.*
> > > >
> > > > I am executing it on a* 4 node cluster each with a dual core
> processor
> > > > and 4GB ram*.
> > > >
> > > > How can I minimize the amount of record spills. It really makes the
> > > > execution really slow when the spilling starts.
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > > Thanking You,
> > > >
> > > > --
> > > > Regards,
> > > > Ouch Whisper
> > > > 010101010101
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>

Re: too many memory spills

Posted by Prashant Kommireddi <pr...@gmail.com>.

You can do a few things here

   1. Increase mapred.child.java.opts to a higher number (default is
   200MB). You will have to do this while making sure (# of MR slots/node X
   mapred.child.java.opts + 387 < 4GB). May be you want to stay under 3.5GB
   based on other stuff running on those nodes.
   2. Increase "mapred.job.shuffle.input.buffer.percent" to have more heap
   be available for the shuffle
   3.
   4. Set mapred.inmem.merge.threshold to 0
   and mapred.job.reduce.input.buffer.percent to 0.8

You will have to play around with these to see what works for your needs.

You can additionally refer to "Hadoop: Definitive Guide" for tips on config
tuning.

On Thu, Mar 7, 2013 at 1:01 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello Prashant,
>
> I have a CDH installation and by default memory allocated to each task
> tracker is 387 MB.
> And yes these spills are happening on Map and Reduce side.
>
> Still not solved this problem...
>
> Suggestions are welcome.
>
> Thanking You,
>
> Regards,
>
>
> On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > Are these spills happening on map or reduce side? What is the memory
> > allocated to each TaskTracker?
> >
> > On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <ouchwhisper@gmail.com
> > >wrote:
> >
> > > Hello,
> > >
> > > I have a file of size 9GB and having approximately 109.5 million
> records.
> > > I execute a pig script on this file that is doing:
> > > 1. Group by on a field of the file
> > > 2. Count number of records in every group
> > > 3. Store the result in a CSV file using normal PigStorage(",")
> > >
> > > The job is completed successfully but the job details show a lot of
> > memory
> > > spills. *Out of 109.5 million records, it shows approximately 48
> million
> > > records spilled.*
> > >
> > > I am executing it on a* 4 node cluster each with a dual core processor
> > > and 4GB ram*.
> > >
> > > How can I minimize the amount of record spills. It really makes the
> > > execution really slow when the spilling starts.
> > >
> > > Any suggestions are welcome.
> > >
> > > Thanking You,
> > >
> > > --
> > > Regards,
> > > Ouch Whisper
> > > 010101010101
> > >
> >
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: too many memory spills

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Prashant,

I have a CDH installation and by default memory allocated to each task
tracker is 387 MB.
And yes these spills are happening on Map and Reduce side.

Still not solved this problem...

Suggestions are welcome.

Thanking You,

Regards,


On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Are these spills happening on map or reduce side? What is the memory
> allocated to each TaskTracker?
>
> On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <ouchwhisper@gmail.com
> >wrote:
>
> > Hello,
> >
> > I have a file of size 9GB and having approximately 109.5 million records.
> > I execute a pig script on this file that is doing:
> > 1. Group by on a field of the file
> > 2. Count number of records in every group
> > 3. Store the result in a CSV file using normal PigStorage(",")
> >
> > The job is completed successfully but the job details show a lot of
> memory
> > spills. *Out of 109.5 million records, it shows approximately 48 million
> > records spilled.*
> >
> > I am executing it on a* 4 node cluster each with a dual core processor
> > and 4GB ram*.
> >
> > How can I minimize the amount of record spills. It really makes the
> > execution really slow when the spilling starts.
> >
> > Any suggestions are welcome.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: too many memory spills

Posted by Prashant Kommireddi <pr...@gmail.com>.

Are these spills happening on map or reduce side? What is the memory
allocated to each TaskTracker?

On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> I have a file of size 9GB and having approximately 109.5 million records.
> I execute a pig script on this file that is doing:
> 1. Group by on a field of the file
> 2. Count number of records in every group
> 3. Store the result in a CSV file using normal PigStorage(",")
>
> The job is completed successfully but the job details show a lot of memory
> spills. *Out of 109.5 million records, it shows approximately 48 million
> records spilled.*
>
> I am executing it on a* 4 node cluster each with a dual core processor
> and 4GB ram*.
>
> How can I minimize the amount of record spills. It really makes the
> execution really slow when the spilling starts.
>
> Any suggestions are welcome.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>