You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/07/23 22:12:50 UTC

[jira] Created: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

finalize in bag implementations causes pig to run out of memory in reduce 
--------------------------------------------------------------------------

                 Key: PIG-1516
                 URL: https://issues.apache.org/jira/browse/PIG-1516
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Thejas M Nair
             Fix For: 0.8.0


*Problem:*
pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.

*Solution:*
The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
 A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.

*Possible workaround for earlier releases:*
Since the fix is going into 0.8, here is a workaround -
Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
To disable combiner, set the property: -Dpig.exec.nocombiner=true


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1516:
-------------------------------

    Status: Patch Available  (was: Open)

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1516:
-------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

Committed to trunk.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891784#action_12891784 ] 

Thejas M Nair commented on PIG-1516:
------------------------------------

Regarding the workaround - I would recommend disabling the combiner only if other steps such as increasing the heap size or increasing the number of reducers do not help.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Ankur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892176#action_12892176 ] 

Ankur commented on PIG-1516:
----------------------------

The solution to have the finalize method AT ALL for the purpose of deleting files when object is garbage collected is NOT a good one. Generally speaking using finalizers to release non-memory resources like file handles should be avoided as it has an insidious bug. From the article on "Object finalization and Cleanup" - http://www.javaworld.com/jw-06-1998/jw-06-techniques.html

"Don't rely on finalizers to release non-memory resources"

An example of an object that breaks this rule is one that opens a file in its constructor and closes the file in its finalize() method. Although this design seems neat, tidy, and symmetrical, it potentially creates an insidious bug. A Java program generally will have only a finite number of file handles at its disposal. When all those handles are in use, the program won't be able to open any more files.  

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894945#action_12894945 ] 

Ashutosh Chauhan commented on PIG-1516:
---------------------------------------

+1. Changes look good.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1516:
-------------------------------

    Attachment: PIG-1516.patch

I haven't removed the use of finalize in this patch,  but with the patch the number of objects with finalize() method that get created should be much smaller, and if all the bags used in a query are small enough, there will not be any such objects created. 
Even in case of large bags that spill to disk, since finalize() is not a method of the bags, the tuples in the bags can be freed by GC without waiting on finalization.
This should stop queries from running out of memory because of of the wait on finalize().

The creation of spill files happens only if the bag is very large, and the processing of the tuples in those bags is likely to give enough time for the finalization thread to catch up. 

The changes - 
1. As I proposed in the solution, the finalize has been removed from DefaultAbstractBag and its subclasses, and a FileList class with a finalize is used as container for the list of spill files.
2. Removed the finalize() method in InternalCachedBag.CachedBagIterator . It was used to call close on DataInputStream. The DataInputStream contains a FileInputStream which would have non-memory resources to be freed. But the FileInputStream already has a finalize() method, so the finalize() method in InternalCachedBag.CachedBagIterator is unnecessary.
3. In the bags that have code to pre-merge files when there are large number of spill files, the files that have been merged into larger files are deleted.


Using WeakReferences as Scott suggested, we can get rid of the finalization completely. I have created a separate jira for that - PIG-1519 .


> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894586#action_12894586 ] 

Thejas M Nair commented on PIG-1516:
------------------------------------

The core and contrib tests pass on my machine.
The release audit warning is about javadoc html files.
Patch is ready for review.


> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair reassigned PIG-1516:
----------------------------------

    Assignee: Thejas M Nair

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1516:
-------------------------------

    Attachment: PIG-1516.2.patch

New patch with fix for findbugs warnings.
I have also run large queries that spill to disk to test the changes in handling of mSpillFiles. 


> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892423#action_12892423 ] 

Scott Carey commented on PIG-1516:
----------------------------------

You avoid finalize() by using a WeakReference. There is no situation that you can't substitute a weak reference for a finalizer, other than object resurrection which is a really bad idea.

finalize() should be avoided for any case that might create many objects.  Its OK to ask the GC to deal with a small number of objects that themselves don't hold many resources.  Its bad form to use finalize() in any case where throughput is high or the object is potentially large.  It will kill performance and  thrash GC.

Extend WeakReference and put the things you need to clean up in it as member variables. Those should also have a strong reference from the bag  the WeakReference should be strongly referenced from the Bag too.   When the Bag is GC'd the objects of interest will no longer have any reference to them other than the WeakReference, and the WeakReference will no longer be strongly referenced.  The WeakReference will be placed onto a Queue of your choosing, and you can then process the queue and access the data required to do any cleanup. 
 
Unlike a finalizer, the actual object is released when GC happens and does not linger.  Only the WeakReference and what it holds onto remains, and you get notified (via the queue) when the object is gone.  Therefore, you have control over your resources and do not rely on the JVM to run the finalizer. 

I have seen performance improvements of ~10x due to moving high volume finalizers to a weak reference queue implementation, along with significantly lower memory consumption.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893589#action_12893589 ] 

Hadoop QA commented on PIG-1516:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12450778/PIG-1516.2.patch
  against trunk revision 980276.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 402 release audit warnings (more than the trunk's current 400 warnings).

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/364/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/364/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/364/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/364/console

This message is automatically generated.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892375#action_12892375 ] 

Dmitriy V. Ryaboy commented on PIG-1516:
----------------------------------------

Another workaround for the meantime:

One can introduce a SmallBagFactory that inherits from BagFactory and produces SmallBags which implement DataBag() without a finalize, and does not implement the file spilling behavior.  SmallBagFactory would return SmallBags when bagFactory.newDefaultBag() is called. Then, provide the system properties pig.data.bag.factory.name and pig.data.bag.factory.jar in pig.properties to point to the new classes. 

Naturally, one has to be certain that databags won't need to spill to disk when doing this...


Ankur -- so what are you suggesting as a fix that avoids finalize? 

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.