You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kelvin Moss <km...@yahoo.com> on 2010/05/07 10:55:26 UTC

Java heap issue



HI all,
 
I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB. 
 
My input dataset (on a given date) is big enough to cause some 960 mappers. 
 
A = load 'input data set' ..;
B = load 'smaller data set'..;
C = JOIN A by key, B by key using "replicated";
D = foreach C generate field1, MyUDF(field2) as field2;
E = store D into 'deleteme';
 
As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.
 
I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.
 
I added the following lines 
F = 'load previous date data'..;
G = union E, F;
H = distinct G parallel $X;
store H into 'deleteme_H';
 
When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?
 
Thanks!   
 
 
 


      

Re: Java heap issue

Posted by Kelvin Moss <km...@yahoo.com>.
Hi Daniel,
 
You're correct that the distinct statement is causing the issue, because if I comment the distinct the script runs fine. However, I ran the script with the -Dpig.exec.nocombiner=true option but still I got the "Java heap issue" error in the mapper. Any idea, why?

Thanks!

--- On Fri, 5/7/10, Daniel Dai <ji...@yahoo-inc.com> wrote:


From: Daniel Dai <ji...@yahoo-inc.com>
Subject: Re: Java heap issue
To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
Date: Friday, May 7, 2010, 10:58 PM


I suspect it is because of the distinct combiner. Try the option -Dpig.exec.nocombiner=true on the command line, see if it works.

Daniel

Kelvin Moss wrote:
> 
> HI all,
>  I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB.  My input dataset (on a given date) is big enough to cause some 960 mappers.  A = load 'input data set' ..;
> B = load 'smaller data set'..;
> C = JOIN A by key, B by key using "replicated";
> D = foreach C generate field1, MyUDF(field2) as field2;
> E = store D into 'deleteme';
>  As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.
>  I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.
>  I added the following lines F = 'load previous date data'..;
> G = union E, F;
> H = distinct G parallel $X;
> store H into 'deleteme_H';
>  When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?
>  Thanks!      
> 
>       




      

Re: Java heap issue

Posted by Daniel Dai <ji...@yahoo-inc.com>.
I suspect it is because of the distinct combiner. Try the option 
-Dpig.exec.nocombiner=true on the command line, see if it works.

Daniel

Kelvin Moss wrote:
>
> HI all,
>  
> I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB. 
>  
> My input dataset (on a given date) is big enough to cause some 960 mappers. 
>  
> A = load 'input data set' ..;
> B = load 'smaller data set'..;
> C = JOIN A by key, B by key using "replicated";
> D = foreach C generate field1, MyUDF(field2) as field2;
> E = store D into 'deleteme';
>  
> As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.
>  
> I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.
>  
> I added the following lines 
> F = 'load previous date data'..;
> G = union E, F;
> H = distinct G parallel $X;
> store H into 'deleteme_H';
>  
> When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?
>  
> Thanks!   
>  
>  
>  
>
>
>