You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Pradeep Kamath <pr...@yahoo-inc.com> on 2008/06/09 23:31:48 UTC

Propsoal for handling "GC overhead limit" errors

Hi,

 

Currently in org.apache.pig.impl.util.SpillableMemoryManger:

 

1) We use MemoryManagement interface to get notified when the
"collection threshold" exceeds a limit (we set this to
biggest_heap*0.5). With this in place we are still seeing "GC overhead
limit" issues when trying large dataset operations. Observing some runs,
it looks like the notification is not frequent enough and early enough
to prevent memory issues possibly because this notification only occurs
after GC.

 

2) We only attempt to free upto :

long toFree = info.getUsage().getUsed() -
(long)(info.getUsage().getMax()*.5);

This is only the excess amount over the threshold which caused the
notification and is not sufficient to not be called again soon.

 

3) While iterating over spillables, if current spillable's memory size
is > gcActivationSize, we try to invoke System.gc

 

4) We *always* invoke System.gc() after iterating over spillables

 

Proposed changes are:

=================

1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
threshold" of biggest_heap*0.7 will be used so we get notified early and
often irrespective of whether garbage collection has occured.

 

2) We will attempt to free 

toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
handleNotification() method is handling a "usage threshold exceeded"
notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
threshold exceeded" case)

 

3) While iterating over spillables, if the *memory freed thus far* is >
gcActivationSize OR if we have freed sufficient memory (based on 2)
above), then we set a flag to invoke System.gc when we exit the loop.  

 

4) We will invoke System.gc() only if the flag is set in 3) above

 

Please provide thoughts/comments.

 

Thanks,

Pradeep

RE: Propsoal for handling "GC overhead limit" errors

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

+1 

> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com] 
> Sent: Wednesday, June 11, 2008 12:18 PM
> To: pig-dev@incubator.apache.org; pi.songs@gmail.com
> Subject: RE: Propsoal for handling "GC overhead limit" errors
> 
> The GC overhead limit error could occur even when we are not 
> low on memory but if memory is fragmented and if the GC 
> spends too much time freeing little memory. Also, we don't 
> want to slow down performance by invoking it too often. 
> Keeping these two in mind, I propose that the 
> GCActiviationSize be applied to the memory freed thus far 
> rather than applying it to the current Spillable's memory 
> size and to set a flag on when this size is reached and 
> invoke GC only once per handler invocation.
> 
> Also I would like to use the following defaults if it is reasonable:
>     // if we freed at least this much, invoke GC 
>     // (default 40 MB - this can be overridden by user supplied
> property)
>     private static long gcActivationSize = 40000000L ;
>     
>     // spill file size should be at least this much
>     // (default 5MB - this can be overridden by user supplied 
> property)
>     private static long spillFileSizeThreshold = 5000000L ;
>     
>     // fraction of biggest heap for which we want to get
>     // "memory usage threshold exceeded" notifications
>     private static double memoryThresholdFraction = 0.7;
>     
>     // fraction of biggest heap for which we want to get
>     // "collection threshold exceeded" notifications
>     private static double collectionMemoryThresholdFraction = 0.5;
> 
> 
> I am currently running more tests to check if previously seen 
> issues with queries are now solved with these changes.
> 
> -Pradeep
> 
> -----Original Message-----
> From: pi song [mailto:pi.songs@gmail.com]
> Sent: Wednesday, June 11, 2008 7:15 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: Propsoal for handling "GC overhead limit" errors
> 
> Sorry. It's actually Long.MAX_VALUE, not Integer.
> 
> On Thu, Jun 12, 2008 at 12:12 AM, pi song <pi...@gmail.com> wrote:
> 
> > Pradeep,
> >
> > I totally buy your biggest_heap*0.7 idea.
> >
> > BUT!!, I've tried this:-
> >
> >         for(int i=0;i<100000;i++) {
> >             StringBuilder sb = new StringBuilder() ;
> >              for(int j=0;j<100;j++) {
> >                 sb.append("hodgdfdsfsddf")   ;
> >              }
> >             System.gc();
> >         }
> > And it doesn't give me any error. So I think calling too 
> often is not
> a
> > problem except it might be slow.
> >
> > GCActiviationSize by default is set to Integer.MAX_VALUE. I believe
> most
> > people have never used.  So, it should have nothing to do with the
> current
> > problem.
> >
> > My concern about using soft/weak reference for data in bag 
> is that if
> the
> > granularity is too fine, we will need more space for those 
> additional 
> > pointers.
> >
> > Pi
> >
> >
> > On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan < 
> > mridulm@yahoo-inc.com> wrote:
> >
> >>
> >>
> >> Ideally, instead of using SpillableMemoryManager, it might 
> be better
> to -
> >>
> >> a) use soft/weak reference to refer to the data in a bag/tuple.
> >> a.1) soft reference since it is less gc sensitive as 
> compared to weak 
> >> reference (a gc kicks all weak ref's out typically). So soft ref's
> are sort
> >> of like a cache which are not so frequently kicked.
> >> b) register them with reference queue and manage the life cycle of 
> >> referrent (to spill/not spill).
> >>  ) override get/put in bag/tuple such that we load off the disk if
> the
> >> referrent is null (this should already be done in some way in the
> code
> >> currently).
> >>
> >>
> >> Ofcourse, this is much more work and is slightly more tricky ... so
> if
> >> SpillablyMemoryManager can handle the requirements, it should work
> fine.
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> Pradeep Kamath wrote:
> >>
> >>> Hi,
> >>>
> >>>
> >>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
> >>>
> >>>
> >>> 1) We use MemoryManagement interface to get notified when the 
> >>> "collection threshold" exceeds a limit (we set this to 
> >>> biggest_heap*0.5). With this in place we are still seeing "GC
> overhead
> >>> limit" issues when trying large dataset operations. Observing some
> runs,
> >>> it looks like the notification is not frequent enough and early
> enough
> >>> to prevent memory issues possibly because this notification only
> occurs
> >>> after GC.
> >>>
> >>>
> >>> 2) We only attempt to free upto :
> >>>
> >>> long toFree = info.getUsage().getUsed() - 
> >>> (long)(info.getUsage().getMax()*.5);
> >>>
> >>> This is only the excess amount over the threshold which 
> caused the 
> >>> notification and is not sufficient to not be called again soon.
> >>>
> >>>
> >>> 3) While iterating over spillables, if current spillable's memory
> size
> >>> is > gcActivationSize, we try to invoke System.gc
> >>>
> >>>
> >>> 4) We *always* invoke System.gc() after iterating over spillables
> >>>
> >>>
> >>> Proposed changes are:
> >>>
> >>> =================
> >>>
> >>> 1) In addition to "collection threshold" of biggest_heap*0.5, a
> "usage
> >>> threshold" of biggest_heap*0.7 will be used so we get 
> notified early
> and
> >>> often irrespective of whether garbage collection has occured.
> >>>
> >>>
> >>> 2) We will attempt to free
> >>> toFree = info.getUsage().getUsed() - threshold + 
> (long)(threshold * 
> >>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
> >>> handleNotification() method is handling a "usage 
> threshold exceeded"
> >>> notification and (info.getUsage().getMax() * 0.5) otherwise
> ("collection
> >>> threshold exceeded" case)
> >>>
> >>>
> >>> 3) While iterating over spillables, if the *memory freed thus far*
> is >
> >>> gcActivationSize OR if we have freed sufficient memory 
> (based on 2) 
> >>> above), then we set a flag to invoke System.gc when we exit the
> loop.
> >>>
> >>> 4) We will invoke System.gc() only if the flag is set in 3) above
> >>>
> >>>
> >>> Please provide thoughts/comments.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Pradeep
> >>>
> >>>
> >>>
> >>
> >
>

RE: Propsoal for handling "GC overhead limit" errors

Posted by Pradeep Kamath <pr...@yahoo-inc.com>.

The GC overhead limit error could occur even when we are not low on
memory but if memory is fragmented and if the GC spends too much time
freeing little memory. Also, we don't want to slow down performance by
invoking it too often. Keeping these two in mind, I propose that the
GCActiviationSize be applied to the memory freed thus far rather than
applying it to the current Spillable's memory size and to set a flag on
when this size is reached and invoke GC only once per handler
invocation.

Also I would like to use the following defaults if it is reasonable:
    // if we freed at least this much, invoke GC 
    // (default 40 MB - this can be overridden by user supplied
property)
    private static long gcActivationSize = 40000000L ;
    
    // spill file size should be at least this much
    // (default 5MB - this can be overridden by user supplied property)
    private static long spillFileSizeThreshold = 5000000L ;
    
    // fraction of biggest heap for which we want to get
    // "memory usage threshold exceeded" notifications
    private static double memoryThresholdFraction = 0.7;
    
    // fraction of biggest heap for which we want to get
    // "collection threshold exceeded" notifications
    private static double collectionMemoryThresholdFraction = 0.5;


I am currently running more tests to check if previously seen issues
with queries are now solved with these changes.

-Pradeep

-----Original Message-----
From: pi song [mailto:pi.songs@gmail.com] 
Sent: Wednesday, June 11, 2008 7:15 AM
To: pig-dev@incubator.apache.org
Subject: Re: Propsoal for handling "GC overhead limit" errors

Sorry. It's actually Long.MAX_VALUE, not Integer.

On Thu, Jun 12, 2008 at 12:12 AM, pi song <pi...@gmail.com> wrote:

> Pradeep,
>
> I totally buy your biggest_heap*0.7 idea.
>
> BUT!!, I've tried this:-
>
>         for(int i=0;i<100000;i++) {
>             StringBuilder sb = new StringBuilder() ;
>              for(int j=0;j<100;j++) {
>                 sb.append("hodgdfdsfsddf")   ;
>              }
>             System.gc();
>         }
> And it doesn't give me any error. So I think calling too often is not
a
> problem except it might be slow.
>
> GCActiviationSize by default is set to Integer.MAX_VALUE. I believe
most
> people have never used.  So, it should have nothing to do with the
current
> problem.
>
> My concern about using soft/weak reference for data in bag is that if
the
> granularity is too fine, we will need more space for those additional
> pointers.
>
> Pi
>
>
> On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <
> mridulm@yahoo-inc.com> wrote:
>
>>
>>
>> Ideally, instead of using SpillableMemoryManager, it might be better
to -
>>
>> a) use soft/weak reference to refer to the data in a bag/tuple.
>> a.1) soft reference since it is less gc sensitive as compared to weak
>> reference (a gc kicks all weak ref's out typically). So soft ref's
are sort
>> of like a cache which are not so frequently kicked.
>> b) register them with reference queue and manage the life cycle of
>> referrent (to spill/not spill).
>>  ) override get/put in bag/tuple such that we load off the disk if
the
>> referrent is null (this should already be done in some way in the
code
>> currently).
>>
>>
>> Ofcourse, this is much more work and is slightly more tricky ... so
if
>> SpillablyMemoryManager can handle the requirements, it should work
fine.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC
overhead
>>> limit" issues when trying large dataset operations. Observing some
runs,
>>> it looks like the notification is not frequent enough and early
enough
>>> to prevent memory issues possibly because this notification only
occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory
size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a
"usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early
and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise
("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far*
is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the
loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by pi song <pi...@gmail.com>.

Sorry. It's actually Long.MAX_VALUE, not Integer.

On Thu, Jun 12, 2008 at 12:12 AM, pi song <pi...@gmail.com> wrote:

> Pradeep,
>
> I totally buy your biggest_heap*0.7 idea.
>
> BUT!!, I've tried this:-
>
>         for(int i=0;i<100000;i++) {
>             StringBuilder sb = new StringBuilder() ;
>              for(int j=0;j<100;j++) {
>                 sb.append("hodgdfdsfsddf")   ;
>              }
>             System.gc();
>         }
> And it doesn't give me any error. So I think calling too often is not a
> problem except it might be slow.
>
> GCActiviationSize by default is set to Integer.MAX_VALUE. I believe most
> people have never used.  So, it should have nothing to do with the current
> problem.
>
> My concern about using soft/weak reference for data in bag is that if the
> granularity is too fine, we will need more space for those additional
> pointers.
>
> Pi
>
>
> On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <
> mridulm@yahoo-inc.com> wrote:
>
>>
>>
>> Ideally, instead of using SpillableMemoryManager, it might be better to -
>>
>> a) use soft/weak reference to refer to the data in a bag/tuple.
>> a.1) soft reference since it is less gc sensitive as compared to weak
>> reference (a gc kicks all weak ref's out typically). So soft ref's are sort
>> of like a cache which are not so frequently kicked.
>> b) register them with reference queue and manage the life cycle of
>> referrent (to spill/not spill).
>>  ) override get/put in bag/tuple such that we load off the disk if the
>> referrent is null (this should already be done in some way in the code
>> currently).
>>
>>
>> Ofcourse, this is much more work and is slightly more tricky ... so if
>> SpillablyMemoryManager can handle the requirements, it should work fine.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>>> limit" issues when trying large dataset operations. Observing some runs,
>>> it looks like the notification is not frequent enough and early enough
>>> to prevent memory issues possibly because this notification only occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far* is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by pi song <pi...@gmail.com>.

Pradeep,

I totally buy your biggest_heap*0.7 idea.

BUT!!, I've tried this:-

        for(int i=0;i<100000;i++) {
            StringBuilder sb = new StringBuilder() ;
             for(int j=0;j<100;j++) {
                sb.append("hodgdfdsfsddf")   ;
             }
            System.gc();
        }
And it doesn't give me any error. So I think calling too often is not a
problem except it might be slow.

GCActiviationSize by default is set to Integer.MAX_VALUE. I believe most
people have never used.  So, it should have nothing to do with the current
problem.

My concern about using soft/weak reference for data in bag is that if the
granularity is too fine, we will need more space for those additional
pointers.

Pi

On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <mr...@yahoo-inc.com>
wrote:

>
>
> Ideally, instead of using SpillableMemoryManager, it might be better to -
>
> a) use soft/weak reference to refer to the data in a bag/tuple.
> a.1) soft reference since it is less gc sensitive as compared to weak
> reference (a gc kicks all weak ref's out typically). So soft ref's are sort
> of like a cache which are not so frequently kicked.
> b) register them with reference queue and manage the life cycle of
> referrent (to spill/not spill).
>  ) override get/put in bag/tuple such that we load off the disk if the
> referrent is null (this should already be done in some way in the code
> currently).
>
>
> Ofcourse, this is much more work and is slightly more tricky ... so if
> SpillablyMemoryManager can handle the requirements, it should work fine.
>
>
> Regards,
> Mridul
>
>
>
> Pradeep Kamath wrote:
>
>> Hi,
>>
>>
>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>
>>
>> 1) We use MemoryManagement interface to get notified when the
>> "collection threshold" exceeds a limit (we set this to
>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>> limit" issues when trying large dataset operations. Observing some runs,
>> it looks like the notification is not frequent enough and early enough
>> to prevent memory issues possibly because this notification only occurs
>> after GC.
>>
>>
>> 2) We only attempt to free upto :
>>
>> long toFree = info.getUsage().getUsed() -
>> (long)(info.getUsage().getMax()*.5);
>>
>> This is only the excess amount over the threshold which caused the
>> notification and is not sufficient to not be called again soon.
>>
>>
>> 3) While iterating over spillables, if current spillable's memory size
>> is > gcActivationSize, we try to invoke System.gc
>>
>>
>> 4) We *always* invoke System.gc() after iterating over spillables
>>
>>
>> Proposed changes are:
>>
>> =================
>>
>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>> often irrespective of whether garbage collection has occured.
>>
>>
>> 2) We will attempt to free
>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>> handleNotification() method is handling a "usage threshold exceeded"
>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>> threshold exceeded" case)
>>
>>
>> 3) While iterating over spillables, if the *memory freed thus far* is >
>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>> above), then we set a flag to invoke System.gc when we exit the loop.
>>
>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>
>>
>> Please provide thoughts/comments.
>>
>>
>> Thanks,
>>
>> Pradeep
>>
>>
>>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.


Ideally, instead of using SpillableMemoryManager, it might be better to -

a) use soft/weak reference to refer to the data in a bag/tuple.
a.1) soft reference since it is less gc sensitive as compared to weak 
reference (a gc kicks all weak ref's out typically). So soft ref's are 
sort of like a cache which are not so frequently kicked.
b) register them with reference queue and manage the life cycle of 
referrent (to spill/not spill).
  ) override get/put in bag/tuple such that we load off the disk if the 
referrent is null (this should already be done in some way in the code 
currently).


Ofcourse, this is much more work and is slightly more tricky ... so if 
SpillablyMemoryManager can handle the requirements, it should work fine.


Regards,
Mridul


Pradeep Kamath wrote:
> Hi,
> 
>  
> 
> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
> 
>  
> 
> 1) We use MemoryManagement interface to get notified when the
> "collection threshold" exceeds a limit (we set this to
> biggest_heap*0.5). With this in place we are still seeing "GC overhead
> limit" issues when trying large dataset operations. Observing some runs,
> it looks like the notification is not frequent enough and early enough
> to prevent memory issues possibly because this notification only occurs
> after GC.
> 
>  
> 
> 2) We only attempt to free upto :
> 
> long toFree = info.getUsage().getUsed() -
> (long)(info.getUsage().getMax()*.5);
> 
> This is only the excess amount over the threshold which caused the
> notification and is not sufficient to not be called again soon.
> 
>  
> 
> 3) While iterating over spillables, if current spillable's memory size
> is > gcActivationSize, we try to invoke System.gc
> 
>  
> 
> 4) We *always* invoke System.gc() after iterating over spillables
> 
>  
> 
> Proposed changes are:
> 
> =================
> 
> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
> threshold" of biggest_heap*0.7 will be used so we get notified early and
> often irrespective of whether garbage collection has occured.
> 
>  
> 
> 2) We will attempt to free 
> 
> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
> handleNotification() method is handling a "usage threshold exceeded"
> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
> threshold exceeded" case)
> 
>  
> 
> 3) While iterating over spillables, if the *memory freed thus far* is >
> gcActivationSize OR if we have freed sufficient memory (based on 2)
> above), then we set a flag to invoke System.gc when we exit the loop.  
> 
>  
> 
> 4) We will invoke System.gc() only if the flag is set in 3) above
> 
>  
> 
> Please provide thoughts/comments.
> 
>  
> 
> Thanks,
> 
> Pradeep
> 
>

RE: Propsoal for handling "GC overhead limit" errors

Posted by Pradeep Kamath <pr...@yahoo-inc.com>.

I have some test numbers below in the mail, but first the discussion
items:

Going by
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html#0.0.0.0.Out-of-M
emory%20Exceptions%7Coutline, I think the "GC overhead limit" exception
is thrown when the GC spends 98% of its time freeing less than 2%
memory. The "java heap space" error is a more direct error that we are
out of space. So the GC should be invoked judiciously so as not to hit
the "overhead limit".

In http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#gc(),
it says System.gc() "..suggests that the JVM expend effort.." - does
this mean that the GC may not actually run? If we keep the
GCActivationSize to apply to the current spillable's memory size, we
could potentially (if the GC is actually called!) prevent the smaller
nested bags in a big bag from being spilled. However we could again
invoke the GC after further iteration if we freed enough memory - this
double call in quick succession within the same handler invocation could
potentially trigger an "overhead limit" exception. Hence I would like to
keep the GCActiviationSize to apply to the memory freed thus far rather
than the applying it to the current Spillable's memory size and set a
flag on when this size is reached and invoke GC only once per handler
invocation. This is a tradeoff - either we prevent redundant spills of
smaller nested bags OR we prevent double calls of System.gc() in same
handler invocation.

Re: Alan's concern:
Given the description above about GC overhead limit, I am concerned that
if we invoke GC (without an activation limit) we might get the GC into a
mode where it spends 98% its time freeing < 2% memory and hence cause an
exception. 
1) We could keep track of spill sizes between GC invocations and reduce
the dribble by invoking GC when cumulative spill sizes crosses
activation limit.
2) We could keep the GCActivationSize close to the
spillFileSizeThreshold and hence cause the GC to be invoked more often
(again risking the "overhead limit" if it is too close) - for example,
spillFileSizeThreshold = 5 MB and GCActivationSize = 40MB (4% of 1 GB) -
so in the worst case, we would invoke GC in 8 small spills of 5MB if we
do 1) above.

Thoughts?

Here are test results with run times with the new changes (only the
changes initially proposed, not the ones being discussed here):

Script run on 9 nodes, -Xmx512m (max heap size) with data contains 200
million rows:
a = load '/user/pig/tests/data/singlefile/studenttab200m'; b = group a
all; c = foreach b generate COUNT(a.$0); store c into
'/tmp/pig/bigdata_out';
new code: 1 hr, 21 mins, 1 sec
old code: 8 hrs, 26mins, 28 secs [3 reduce attemtpts - 1st attempt had
GC overhead limit exceeded error., 2nd attempt had hadoop issues ("Lost
task tracker"), 3rd attempt succeeded ]

Script run on 9 nodes, -Xmx512m (max heap size) with data contains 200
million rows:
a = load '/user/pig/tests/data/singlefile/studenttab200m'; b = group a
by $0; c = foreach b generate COUNT(a.$0), group; store c into
'/tmp/pig/bigdata_complex_out';
new code: 1hrs, 9mins, 53sec
old code: 1hrs, 12mins, 25sec

Script run on 1 node, -xmx512m (max heap size) with data containing 20
million rows:
a = load '/user/pradeepk/studenttab20m'; b = group a all; c = foreach b
generate COUNT(a.$0); store c into '/tmp/pig/meddata_out';
New code: 28mins, 19sec
old code:Failed with 3 attempts in reduce all with java heap space
errors

Script run on 9 nodes, -Xmx512m (max heap size) with data contains 20
million rows:
a = load '/user/pig/tests/data/singlefile/studenttab20m'; b = group a
all; c = foreach b generate COUNT(a.$0); store c into
'/tmp/pig/meddata_out';
new code: 6mins, 37sec
old code: 23mins, 22sec - old code sometimes gives gc allocation
overhead errors

Pradeep

-----Original Message-----
From: pi song [mailto:pi.songs@gmail.com] 
Sent: Tuesday, June 10, 2008 7:54 AM
To: pig-dev@incubator.apache.org
Subject: Re: Propsoal for handling "GC overhead limit" errors

GC Overhead limit means OutOfMemory, right? Then we should think about
ideas
to save memory. I've heard about memory compression technique before but
it
is only good when we access the data sequentially and of course this
incurs
some performance impact. My 2 cents.

On Wed, Jun 11, 2008 at 12:41 AM, pi song <pi...@gmail.com> wrote:

> Pradeep's (3) is what I thought before but I ended up implementing the
way
> it is because I believed disk I/O should be slow anyway. Adding just a
gc
> call shouldn't cause much trouble (we are not calling too often
anyway). (4)
> will be called only once per each notification so again should not be
> considered too expensive.
>
> That (3) bit also serves another purpose to help reduce small spills:-
> (This is what I posted before)
> "Based on the fact that now we spill big bags first, my observation is
that
> there are still cases where a big container bag is spilled and
therefore its
> mContent becomes empty but most of its inner bags' WeakReferences
aren't
> clean-up by GC yet. In such cases, if we haven't freed up enough
memory,
> those inner bags will be unnecessarily spilled (however all their
contents
> were already spilled in the big bag spill)"
>
> Pi
>
>
> On Tue, Jun 10, 2008 at 11:06 AM, Alan Gates <ga...@yahoo-inc.com>
wrote:
>
>> My concern with the methodology is that we can get into a dribble
mode.
>>  Consider the following scenario:
>>
>> 1) We get a usage threshold exceeded notification.
>> 2) We spill, but not enough to activate the garbage collector.
>> 3) Next time the jvm checks, will we still get a usage exceeded
threshold?
>>  I assume, since the gc won't have run.  But at this point it's
highly
>> unlikely that we'll spill enough to activate the gc.  From here on
out we're
>> stuck, spilling little bits but not calling the gc until the system
invokes
>> it.
>>
>> We could mitigate this some by tracking spill sizes across spills and
>> invoking the gc when we reach the threshold.  This does not avoid the
>> dribble, it does shorten it.
>>
>> I think any time we spill we should invoke the gc to avoid the
dribble.
>>  Pradeep is concerned that this will cause us to invoke the gc too
often,
>> which is a possible cause of the error we see.  Is it possible to
estimate
>> our spill size before we start spilling and decide up front whether
to try
>> it or not?
>>  Alan.
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC
overhead
>>> limit" issues when trying large dataset operations. Observing some
runs,
>>> it looks like the notification is not frequent enough and early
enough
>>> to prevent memory issues possibly because this notification only
occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory
size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a
"usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early
and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise
("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far*
is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the
loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>>
>>
>>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by pi song <pi...@gmail.com>.

GC Overhead limit means OutOfMemory, right? Then we should think about ideas
to save memory. I've heard about memory compression technique before but it
is only good when we access the data sequentially and of course this incurs
some performance impact. My 2 cents.

On Wed, Jun 11, 2008 at 12:41 AM, pi song <pi...@gmail.com> wrote:

> Pradeep's (3) is what I thought before but I ended up implementing the way
> it is because I believed disk I/O should be slow anyway. Adding just a gc
> call shouldn't cause much trouble (we are not calling too often anyway). (4)
> will be called only once per each notification so again should not be
> considered too expensive.
>
> That (3) bit also serves another purpose to help reduce small spills:-
> (This is what I posted before)
> "Based on the fact that now we spill big bags first, my observation is that
> there are still cases where a big container bag is spilled and therefore its
> mContent becomes empty but most of its inner bags' WeakReferences aren't
> clean-up by GC yet. In such cases, if we haven't freed up enough memory,
> those inner bags will be unnecessarily spilled (however all their contents
> were already spilled in the big bag spill)"
>
> Pi
>
>
> On Tue, Jun 10, 2008 at 11:06 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> My concern with the methodology is that we can get into a dribble mode.
>>  Consider the following scenario:
>>
>> 1) We get a usage threshold exceeded notification.
>> 2) We spill, but not enough to activate the garbage collector.
>> 3) Next time the jvm checks, will we still get a usage exceeded threshold?
>>  I assume, since the gc won't have run.  But at this point it's highly
>> unlikely that we'll spill enough to activate the gc.  From here on out we're
>> stuck, spilling little bits but not calling the gc until the system invokes
>> it.
>>
>> We could mitigate this some by tracking spill sizes across spills and
>> invoking the gc when we reach the threshold.  This does not avoid the
>> dribble, it does shorten it.
>>
>> I think any time we spill we should invoke the gc to avoid the dribble.
>>  Pradeep is concerned that this will cause us to invoke the gc too often,
>> which is a possible cause of the error we see.  Is it possible to estimate
>> our spill size before we start spilling and decide up front whether to try
>> it or not?
>>  Alan.
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>>> limit" issues when trying large dataset operations. Observing some runs,
>>> it looks like the notification is not frequent enough and early enough
>>> to prevent memory issues possibly because this notification only occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far* is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>>
>>
>>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by pi song <pi...@gmail.com>.

Pradeep's (3) is what I thought before but I ended up implementing the way
it is because I believed disk I/O should be slow anyway. Adding just a gc
call shouldn't cause much trouble (we are not calling too often anyway). (4)
will be called only once per each notification so again should not be
considered too expensive.

That (3) bit also serves another purpose to help reduce small spills:-
(This is what I posted before)
"Based on the fact that now we spill big bags first, my observation is that
there are still cases where a big container bag is spilled and therefore its
mContent becomes empty but most of its inner bags' WeakReferences aren't
clean-up by GC yet. In such cases, if we haven't freed up enough memory,
those inner bags will be unnecessarily spilled (however all their contents
were already spilled in the big bag spill)"

Pi

On Tue, Jun 10, 2008 at 11:06 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> My concern with the methodology is that we can get into a dribble mode.
>  Consider the following scenario:
>
> 1) We get a usage threshold exceeded notification.
> 2) We spill, but not enough to activate the garbage collector.
> 3) Next time the jvm checks, will we still get a usage exceeded threshold?
>  I assume, since the gc won't have run.  But at this point it's highly
> unlikely that we'll spill enough to activate the gc.  From here on out we're
> stuck, spilling little bits but not calling the gc until the system invokes
> it.
>
> We could mitigate this some by tracking spill sizes across spills and
> invoking the gc when we reach the threshold.  This does not avoid the
> dribble, it does shorten it.
>
> I think any time we spill we should invoke the gc to avoid the dribble.
>  Pradeep is concerned that this will cause us to invoke the gc too often,
> which is a possible cause of the error we see.  Is it possible to estimate
> our spill size before we start spilling and decide up front whether to try
> it or not?
> Alan.
>
>
> Pradeep Kamath wrote:
>
>> Hi,
>>
>>
>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>
>>
>> 1) We use MemoryManagement interface to get notified when the
>> "collection threshold" exceeds a limit (we set this to
>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>> limit" issues when trying large dataset operations. Observing some runs,
>> it looks like the notification is not frequent enough and early enough
>> to prevent memory issues possibly because this notification only occurs
>> after GC.
>>
>>
>> 2) We only attempt to free upto :
>>
>> long toFree = info.getUsage().getUsed() -
>> (long)(info.getUsage().getMax()*.5);
>>
>> This is only the excess amount over the threshold which caused the
>> notification and is not sufficient to not be called again soon.
>>
>>
>> 3) While iterating over spillables, if current spillable's memory size
>> is > gcActivationSize, we try to invoke System.gc
>>
>>
>> 4) We *always* invoke System.gc() after iterating over spillables
>>
>>
>> Proposed changes are:
>>
>> =================
>>
>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>> often irrespective of whether garbage collection has occured.
>>
>>
>> 2) We will attempt to free
>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>> handleNotification() method is handling a "usage threshold exceeded"
>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>> threshold exceeded" case)
>>
>>
>> 3) While iterating over spillables, if the *memory freed thus far* is >
>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>> above), then we set a flag to invoke System.gc when we exit the loop.
>>
>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>
>>
>> Please provide thoughts/comments.
>>
>>
>> Thanks,
>>
>> Pradeep
>>
>>
>>
>>
>
>

Re: Propsoal for handling "GC overhead limit" errors

Posted by Alan Gates <ga...@yahoo-inc.com>.

My concern with the methodology is that we can get into a dribble mode.  
Consider the following scenario:

1) We get a usage threshold exceeded notification.
2) We spill, but not enough to activate the garbage collector.
3) Next time the jvm checks, will we still get a usage exceeded 
threshold?  I assume, since the gc won't have run.  But at this point 
it's highly unlikely that we'll spill enough to activate the gc.  From 
here on out we're stuck, spilling little bits but not calling the gc 
until the system invokes it.

We could mitigate this some by tracking spill sizes across spills and 
invoking the gc when we reach the threshold.  This does not avoid the 
dribble, it does shorten it.

I think any time we spill we should invoke the gc to avoid the dribble.  
Pradeep is concerned that this will cause us to invoke the gc too often, 
which is a possible cause of the error we see.  Is it possible to 
estimate our spill size before we start spilling and decide up front 
whether to try it or not? 

Alan.

Pradeep Kamath wrote:
> Hi,
>
>  
>
> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>
>  
>
> 1) We use MemoryManagement interface to get notified when the
> "collection threshold" exceeds a limit (we set this to
> biggest_heap*0.5). With this in place we are still seeing "GC overhead
> limit" issues when trying large dataset operations. Observing some runs,
> it looks like the notification is not frequent enough and early enough
> to prevent memory issues possibly because this notification only occurs
> after GC.
>
>  
>
> 2) We only attempt to free upto :
>
> long toFree = info.getUsage().getUsed() -
> (long)(info.getUsage().getMax()*.5);
>
> This is only the excess amount over the threshold which caused the
> notification and is not sufficient to not be called again soon.
>
>  
>
> 3) While iterating over spillables, if current spillable's memory size
> is > gcActivationSize, we try to invoke System.gc
>
>  
>
> 4) We *always* invoke System.gc() after iterating over spillables
>
>  
>
> Proposed changes are:
>
> =================
>
> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
> threshold" of biggest_heap*0.7 will be used so we get notified early and
> often irrespective of whether garbage collection has occured.
>
>  
>
> 2) We will attempt to free 
>
> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
> handleNotification() method is handling a "usage threshold exceeded"
> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
> threshold exceeded" case)
>
>  
>
> 3) While iterating over spillables, if the *memory freed thus far* is >
> gcActivationSize OR if we have freed sufficient memory (based on 2)
> above), then we set a flag to invoke System.gc when we exit the loop.  
>
>  
>
> 4) We will invoke System.gc() only if the flag is set in 3) above
>
>  
>
> Please provide thoughts/comments.
>
>  
>
> Thanks,
>
> Pradeep
>
>
>

RE: Propsoal for handling "GC overhead limit" errors

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Pradeep,

Have you tested this? If so,

(1) Did the problem go away for the queries you tested?
(2) What effect did it have on the performance of the queries that run
successfully and spill.

Thanks,

Olga
 

> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com] 
> Sent: Monday, June 09, 2008 2:32 PM
> To: pig-dev@incubator.apache.org
> Subject: Propsoal for handling "GC overhead limit" errors
> 
> Hi,
> 
>  
> 
> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
> 
>  
> 
> 1) We use MemoryManagement interface to get notified when the 
> "collection threshold" exceeds a limit (we set this to 
> biggest_heap*0.5). With this in place we are still seeing "GC 
> overhead limit" issues when trying large dataset operations. 
> Observing some runs, it looks like the notification is not 
> frequent enough and early enough to prevent memory issues 
> possibly because this notification only occurs after GC.
> 
>  
> 
> 2) We only attempt to free upto :
> 
> long toFree = info.getUsage().getUsed() - 
> (long)(info.getUsage().getMax()*.5);
> 
> This is only the excess amount over the threshold which 
> caused the notification and is not sufficient to not be 
> called again soon.
> 
>  
> 
> 3) While iterating over spillables, if current spillable's 
> memory size is > gcActivationSize, we try to invoke System.gc
> 
>  
> 
> 4) We *always* invoke System.gc() after iterating over spillables
> 
>  
> 
> Proposed changes are:
> 
> =================
> 
> 1) In addition to "collection threshold" of biggest_heap*0.5, 
> a "usage threshold" of biggest_heap*0.7 will be used so we 
> get notified early and often irrespective of whether garbage 
> collection has occured.
> 
>  
> 
> 2) We will attempt to free 
> 
> toFree = info.getUsage().getUsed() - threshold + 
> (long)(threshold * 0.5); where threshold is 
> (info.getUsage().getMax() * 0.7) if the
> handleNotification() method is handling a "usage threshold exceeded"
> notification and (info.getUsage().getMax() * 0.5) otherwise 
> ("collection threshold exceeded" case)
> 
>  
> 
> 3) While iterating over spillables, if the *memory freed thus 
> far* is > gcActivationSize OR if we have freed sufficient 
> memory (based on 2) above), then we set a flag to invoke 
> System.gc when we exit the loop.  
> 
>  
> 
> 4) We will invoke System.gc() only if the flag is set in 3) above
> 
>  
> 
> Please provide thoughts/comments.
> 
>  
> 
> Thanks,
> 
> Pradeep
> 
>