You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/01/03 18:45:34 UTC
[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-30:
--------------------------

    Attachment: bagrewrite.patch

The attached patch file contains a rewrite of DataBag in line with the proposal given in previous comments.  Highlights include:

    * DataBag has been entirely rewritten.  As part of this the interface has been brought into line with standard java container interface (size() instead of cardinality() and iterator() instead of content()).  cardinality() and content() have been kept for backward compatibility but marked as deprecated.  Also as part of this change, DataBag has become an abstract class.  Also, functionality to sort and apply distinct to a bag have been removed.  This functionality is now provided by subclasses instead.

    * BigDataBag has been removed.  All data bags can now spill to disk when necessary.

    * DefaultDataBag, SortedDataBag, and DistinctDataBag have been added.  Each of these extends DataBag.

    * BagFactory has been entirely rewritten.  As part of this its interface has been changed in a non-backward compatible way.  Now the caller must specify up front what type of bag (default, sorted, distinct) is desired, and the appropriate type of bag will be provided.  In making these changes I assumed that users never directly call BagFactory, and thus changing the interface won't break any UDFs.  If this assumption is wrong, please let me know.

    * Spillable interface has been added.  This interface says that an implementing class can be asked by the system to spill its contents to the disk.  DataBag implements Spillable.

    * SpillableMemoryManager has been added (courtesy of Ben).  This memory manager registers with the JVM to be called when the largest memory pool becomes more than 50% full.  It then goes through its list of Spillable objects and asks them to spill.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.