You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Aniket Mokashi <am...@andrew.cmu.edu> on 2011/02/24 04:49:55 UTC

UDF problem: Java Heap space

I ve written a simple UDF that parses a chararray (which looks like
...[a].....[b]...[a]...) to capture stuff inside brackets and return them
as String a=2;b=1; and so on. The input chararray are rarely more than
1000 characters and are not more than 100000 (I ve added log.warn in my
udf to ensure this). But, I still see java heap error while running this
udf (even in local mode, the job simply fails). My assumption is maps and
lists that I use locally will be recollected by gc. Am I missing
something?

Thanks,
Aniket


Re: UDF problem: Java Heap space

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.
Thanks everyone for helping me out, I figured it was one of those logical
errors which lead to infinite loops. Actually indexof operation doesnt
always return -1 on failure which was causing this to get into infinite
loop (I should have thought about this). (ie. indexof('[', 187) would
return 187 and the loop would continue always.
Thanks again,
Aniket

On Thu, February 24, 2011 7:47 pm, Aniket Mokashi wrote:
> This is a map side udf.
> pig script loads a log file and grabs contents inside angle brackets. a =
> load; b = foreach a generate F(a); dump b;
>
> I see following on tasktrackers-
> 2011-02-23 18:01:25,992 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
>  - Collection threshold init = 5439488(5312K) used = 409337824(399743K)
> committed = 534118400(521600K) max = 715849728(699072K) 2011-02-23
> 18:01:26,102 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
> committed = 671547392(655808K) max = 715849728(699072K)
>
> I am trying out some changes in udf to see if they work.
>
>
> Thanks,
> Aniket
>
>
> On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
>
>> Hi, Aniket,
>> What is your Pig script? Is the UDF in map side or reduce side?
>>
>>
>>
>> Daniel
>>
>>
>>
>> Dmitriy Ryaboy wrote:
>>
>>
>>> That's a max of 3.3K single-character strings. Even with the java
>>> overhead that shouldn't be more than a meg right? none of these should
>>>  make it out of young gen assuming the list "cats" doesn't stick
>>> around outside the udf.
>>>
>>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
>>> <am...@andrew.cmu.edu>wrote:
>>>
>>>
>>>
>>>
>>>> Hi Jai,
>>>>
>>>>
>>>>
>>>> Thanks for your email. I suspect that its the Strings in tight loop
>>>>  reason as you have suggested. I have a loop in my udf that does
>>>> the following.
>>>>
>>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
>>>> someLog.indexOf(']', startInd); if(endInd > 0) { category =
>>>> someLog.substring(startInd, endInd+1); cats.add(category); }
>>>> startInd = endInd; }
>>>>
>>>>
>>>> My jobs are failing in both local and mr mode. UDF works fine for a
>>>>  smaller input (a few lines). Also, I checked that sizeof someLog
>>>> doesnt exceed a 10000.
>>>>
>>>> Thanks,
>>>> Aniket
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>>>
>>>>
>>>>
>>>>> Sharing the code would be useful as mentioned. Also of help would
>>>>>  the heap settings that the JVM had.
>>>>>
>>>>> However, off the top of my head, one common situation (esp. in
>>>>> text processing/tokenizing) is instantiating Strings in a tight
>>>>> loop.
>>>>>
>>>>> Besides you could also exercise your UDF in a local JVM and take
>>>>> a heap dump / profile it. If your heap is less than 512M, you
>>>>> could use basic profiling via hprof/hat (see
>>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF
>>>>> .h
>>>>> tml).
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Jai
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Aniket, share the code?
>>>>> It really depends on how you create them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -D
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>>>> <am...@andrew.cmu.edu>wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I ve written a simple UDF that parses a chararray (which looks
>>>>>> like ...[a].....[b]...[a]...) to capture stuff inside brackets
>>>>>> and return them as String a=2;b=1; and so on. The input
>>>>>> chararray are rarely more than 1000 characters and are not more
>>>>>> than 100000 (I ve added log.warn in my udf to ensure this). But,
>>>>>> I still see java
>>>>>> heap error while running this udf (even in local mode, the job
>>>>>> simply fails). My assumption is maps and lists that I use
>>>>>> locally will be recollected by gc. Am I missing something?
>>>>>>
>>>>>> Thanks,
>>>>>> Aniket
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>>
>>
>
>
>
>



Re: UDF problem: Java Heap space

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.
This is a map side udf.
pig script loads a log file and grabs contents inside angle brackets.
a = load; b = foreach a generate F(a); dump b;

I see following on tasktrackers-
2011-02-23 18:01:25,992 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 5439488(5312K) used = 409337824(399743K)
committed = 534118400(521600K) max = 715849728(699072K)
2011-02-23 18:01:26,102 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
committed = 671547392(655808K) max = 715849728(699072K)

I am trying out some changes in udf to see if they work.

Thanks,
Aniket

On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
> Hi, Aniket,
> What is your Pig script? Is the UDF in map side or reduce side?
>
>
> Daniel
>
>
> Dmitriy Ryaboy wrote:
>
>> That's a max of 3.3K single-character strings. Even with the java
>> overhead that shouldn't be more than a meg right? none of these should
>> make it out of young gen assuming the list "cats" doesn't stick around
>> outside the udf.
>>
>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
>> <am...@andrew.cmu.edu>wrote:
>>
>>
>>
>>> Hi Jai,
>>>
>>>
>>> Thanks for your email. I suspect that its the Strings in tight loop
>>> reason as you have suggested. I have a loop in my udf that does the
>>> following.
>>>
>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
>>> someLog.indexOf(']', startInd); if(endInd > 0) { category =
>>> someLog.substring(startInd, endInd+1); cats.add(category); }
>>> startInd = endInd; }
>>>
>>>
>>> My jobs are failing in both local and mr mode. UDF works fine for a
>>> smaller input (a few lines). Also, I checked that sizeof someLog
>>> doesnt exceed a 10000.
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>>
>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>>
>>>
>>>> Sharing the code would be useful as mentioned. Also of help would
>>>> the heap settings that the JVM had.
>>>>
>>>> However, off the top of my head, one common situation (esp. in text
>>>>  processing/tokenizing) is instantiating Strings in a tight loop.
>>>>
>>>> Besides you could also exercise your UDF in a local JVM and take a
>>>> heap dump / profile it. If your heap is less than 512M, you could
>>>> use basic profiling via hprof/hat (see
>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.h
>>>> tml).
>>>>
>>>>
>>>> Thanks,
>>>> Jai
>>>>
>>>>
>>>>
>>>>
>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Aniket, share the code?
>>>> It really depends on how you create them.
>>>>
>>>>
>>>>
>>>> -D
>>>>
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>>> <am...@andrew.cmu.edu>wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I ve written a simple UDF that parses a chararray (which looks
>>>>> like ...[a].....[b]...[a]...) to capture stuff inside brackets and
>>>>> return them as String a=2;b=1; and so on. The input chararray are
>>>>> rarely more than 1000 characters and are not more than 100000 (I
>>>>> ve added log.warn in my udf to ensure this). But, I still see java
>>>>> heap error while running this udf (even in local mode, the job
>>>>> simply fails). My assumption is maps and lists that I use locally
>>>>> will be recollected by gc. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Aniket
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>
>
>



Re: UDF problem: Java Heap space

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Hi, Aniket,
What is your Pig script? Is the UDF in map side or reduce side?

Daniel

Dmitriy Ryaboy wrote:
> That's a max of 3.3K single-character strings. Even with the java overhead
> that shouldn't be more than a meg right?
> none of these should make it out of young gen assuming the list "cats"
> doesn't stick around outside the udf.
>
> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi <am...@andrew.cmu.edu>wrote:
>
>   
>> Hi Jai,
>>
>> Thanks for your email. I suspect that its the Strings in tight loop reason
>> as you have suggested. I have a loop in my udf that does the following.
>>
>> while((startInd = someLog.indexOf('[',startInd)) > 0) {
>>                                endInd = someLog.indexOf(']', startInd);
>>                                if(endInd > 0) {
>>                                        category =
>> someLog.substring(startInd, endInd+1);
>>                                        cats.add(category);
>>                                }
>>                                startInd = endInd;
>>                        }
>>
>> My jobs are failing in both local and mr mode. UDF works fine for a
>> smaller input (a few lines). Also, I checked that sizeof someLog doesnt
>> exceed a 10000.
>>
>> Thanks,
>> Aniket
>>
>>
>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>     
>>> Sharing the code would be useful as mentioned. Also of help would the
>>> heap settings that the JVM had.
>>>
>>> However, off the top of my head, one common situation (esp. in text
>>> processing/tokenizing) is instantiating Strings in a tight loop.
>>>
>>> Besides you could also exercise your UDF in a local JVM and take a heap
>>> dump / profile it. If your heap is less than 512M, you could use basic
>>> profiling via hprof/hat (see
>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html).
>>>
>>>
>>> Thanks,
>>> Jai
>>>
>>>
>>>
>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>>
>>>
>>> Aniket, share the code?
>>> It really depends on how you create them.
>>>
>>>
>>> -D
>>>
>>>
>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>> <am...@andrew.cmu.edu>wrote:
>>>
>>>
>>>       
>>>> I ve written a simple UDF that parses a chararray (which looks like
>>>> ...[a].....[b]...[a]...) to capture stuff inside brackets and return
>>>> them as String a=2;b=1; and so on. The input chararray are rarely more
>>>> than 1000 characters and are not more than 100000 (I ve added log.warn
>>>> in my udf to ensure this). But, I still see java heap error while
>>>> running this udf (even in local mode, the job simply fails). My
>>>> assumption is maps and lists that I use locally will be recollected by
>>>> gc. Am I missing something?
>>>>
>>>> Thanks,
>>>> Aniket
>>>>
>>>>
>>>>
>>>>         
>>>       
>>
>>     


Re: UDF problem: Java Heap space

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
That's a max of 3.3K single-character strings. Even with the java overhead
that shouldn't be more than a meg right?
none of these should make it out of young gen assuming the list "cats"
doesn't stick around outside the udf.

On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi <am...@andrew.cmu.edu>wrote:

> Hi Jai,
>
> Thanks for your email. I suspect that its the Strings in tight loop reason
> as you have suggested. I have a loop in my udf that does the following.
>
> while((startInd = someLog.indexOf('[',startInd)) > 0) {
>                                endInd = someLog.indexOf(']', startInd);
>                                if(endInd > 0) {
>                                        category =
> someLog.substring(startInd, endInd+1);
>                                        cats.add(category);
>                                }
>                                startInd = endInd;
>                        }
>
> My jobs are failing in both local and mr mode. UDF works fine for a
> smaller input (a few lines). Also, I checked that sizeof someLog doesnt
> exceed a 10000.
>
> Thanks,
> Aniket
>
>
> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
> > Sharing the code would be useful as mentioned. Also of help would the
> > heap settings that the JVM had.
> >
> > However, off the top of my head, one common situation (esp. in text
> > processing/tokenizing) is instantiating Strings in a tight loop.
> >
> > Besides you could also exercise your UDF in a local JVM and take a heap
> > dump / profile it. If your heap is less than 512M, you could use basic
> > profiling via hprof/hat (see
> > http://java.sun.com/developer/technicalArticles/Programming/HPROF.html).
> >
> >
> > Thanks,
> > Jai
> >
> >
> >
> > On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> >
> >
> > Aniket, share the code?
> > It really depends on how you create them.
> >
> >
> > -D
> >
> >
> > On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
> > <am...@andrew.cmu.edu>wrote:
> >
> >
> >> I ve written a simple UDF that parses a chararray (which looks like
> >> ...[a].....[b]...[a]...) to capture stuff inside brackets and return
> >> them as String a=2;b=1; and so on. The input chararray are rarely more
> >> than 1000 characters and are not more than 100000 (I ve added log.warn
> >> in my udf to ensure this). But, I still see java heap error while
> >> running this udf (even in local mode, the job simply fails). My
> >> assumption is maps and lists that I use locally will be recollected by
> >> gc. Am I missing something?
> >>
> >> Thanks,
> >> Aniket
> >>
> >>
> >>
> >
> >
>
>
>

Re: UDF problem: Java Heap space

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.
Hi Jai,

Thanks for your email. I suspect that its the Strings in tight loop reason
as you have suggested. I have a loop in my udf that does the following.

while((startInd = someLog.indexOf('[',startInd)) > 0) {
				endInd = someLog.indexOf(']', startInd);
				if(endInd > 0) {
					category = someLog.substring(startInd, endInd+1);
					cats.add(category);
				}
				startInd = endInd;
			}

My jobs are failing in both local and mr mode. UDF works fine for a
smaller input (a few lines). Also, I checked that sizeof someLog doesnt
exceed a 10000.

Thanks,
Aniket


On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
> Sharing the code would be useful as mentioned. Also of help would the
> heap settings that the JVM had.
>
> However, off the top of my head, one common situation (esp. in text
> processing/tokenizing) is instantiating Strings in a tight loop.
>
> Besides you could also exercise your UDF in a local JVM and take a heap
> dump / profile it. If your heap is less than 512M, you could use basic
> profiling via hprof/hat (see
> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html ).
>
>
> Thanks,
> Jai
>
>
>
> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
>
> Aniket, share the code?
> It really depends on how you create them.
>
>
> -D
>
>
> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
> <am...@andrew.cmu.edu>wrote:
>
>
>> I ve written a simple UDF that parses a chararray (which looks like
>> ...[a].....[b]...[a]...) to capture stuff inside brackets and return
>> them as String a=2;b=1; and so on. The input chararray are rarely more
>> than 1000 characters and are not more than 100000 (I ve added log.warn
>> in my udf to ensure this). But, I still see java heap error while
>> running this udf (even in local mode, the job simply fails). My
>> assumption is maps and lists that I use locally will be recollected by
>> gc. Am I missing something?
>>
>> Thanks,
>> Aniket
>>
>>
>>
>
>



Re: UDF problem: Java Heap space

Posted by Jai Krishna <rj...@yahoo-inc.com>.
Sharing the code would be useful as mentioned. Also of help would the heap settings that the JVM had.

However, off the top of my head, one common situation (esp. in text processing/tokenizing) is instantiating Strings in a tight loop.

Besides you could also exercise your UDF in a local JVM and take a heap dump / profile it.
If your heap is less than 512M, you could use basic profiling via hprof/hat (see http://java.sun.com/developer/technicalArticles/Programming/HPROF.html ).

Thanks,
Jai


On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

Aniket, share the code?
It really depends on how you create them.

-D

On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi <am...@andrew.cmu.edu>wrote:

> I ve written a simple UDF that parses a chararray (which looks like
> ...[a].....[b]...[a]...) to capture stuff inside brackets and return them
> as String a=2;b=1; and so on. The input chararray are rarely more than
> 1000 characters and are not more than 100000 (I ve added log.warn in my
> udf to ensure this). But, I still see java heap error while running this
> udf (even in local mode, the job simply fails). My assumption is maps and
> lists that I use locally will be recollected by gc. Am I missing
> something?
>
> Thanks,
> Aniket
>
>


Re: UDF problem: Java Heap space

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Aniket, share the code?
It really depends on how you create them.

-D

On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi <am...@andrew.cmu.edu>wrote:

> I ve written a simple UDF that parses a chararray (which looks like
> ...[a].....[b]...[a]...) to capture stuff inside brackets and return them
> as String a=2;b=1; and so on. The input chararray are rarely more than
> 1000 characters and are not more than 100000 (I ve added log.warn in my
> udf to ensure this). But, I still see java heap error while running this
> udf (even in local mode, the job simply fails). My assumption is maps and
> lists that I use locally will be recollected by gc. Am I missing
> something?
>
> Thanks,
> Aniket
>
>