You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Benjamin Reed (JIRA)" <ji...@apache.org> on 2007/11/27 16:48:43 UTC

[jira] Created: (PIG-30) Get rid of DataBag and always use BigDataBag

Get rid of DataBag and always use BigDataBag
--------------------------------------------

                 Key: PIG-30
                 URL: https://issues.apache.org/jira/browse/PIG-30
             Project: Pig
          Issue Type: Bug
          Components: data
            Reporter: Benjamin Reed
            Priority: Minor


We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556105#action_12556105 ] 

Olga Natkovich commented on PIG-30:
-----------------------------------

+1

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-30:
-----------------------------

    Assignee: Alan Gates

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>            Priority: Minor
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-30:
--------------------------

    Attachment: bagrewrite.patch

The attached patch file contains a rewrite of DataBag in line with the proposal given in previous comments.  Highlights include:

    * DataBag has been entirely rewritten.  As part of this the interface has been brought into line with standard java container interface (size() instead of cardinality() and iterator() instead of content()).  cardinality() and content() have been kept for backward compatibility but marked as deprecated.  Also as part of this change, DataBag has become an abstract class.  Also, functionality to sort and apply distinct to a bag have been removed.  This functionality is now provided by subclasses instead.

    * BigDataBag has been removed.  All data bags can now spill to disk when necessary.

    * DefaultDataBag, SortedDataBag, and DistinctDataBag have been added.  Each of these extends DataBag.

    * BagFactory has been entirely rewritten.  As part of this its interface has been changed in a non-backward compatible way.  Now the caller must specify up front what type of bag (default, sorted, distinct) is desired, and the appropriate type of bag will be provided.  In making these changes I assumed that users never directly call BagFactory, and thus changing the interface won't break any UDFs.  If this assumption is wrong, please let me know.

    * Spillable interface has been added.  This interface says that an implementing class can be asked by the system to spill its contents to the disk.  DataBag implements Spillable.

    * SpillableMemoryManager has been added (courtesy of Ben).  This memory manager registers with the JVM to be called when the largest memory pool becomes more than 50% full.  It then goes through its list of Spillable objects and asks them to spill.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551056 ] 

Olga Natkovich commented on PIG-30:
-----------------------------------

A couple of other issues I observed with BigDataBag:

- Should check memory availability periodically, not on every add
- Try to buffer in memory first. Currently we always write to disk after the first spill


> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557457#action_12557457 ] 

Alan Gates commented on PIG-30:
-------------------------------

Some performance numbers based on the code before and after these changes.  I tested default bags (that is, no sorting, no distinct), distinct bags, and sorted bags.  Each test was run on the code pre- and post-patch.  Each test was run on data with 100k rows, 1m rows, and and 5m rows.

Default:

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b generate group, COUNT(a.$0);

dump c;

Results:

pre patch, 100k rows:  13.539

post 100k:  15.489

pre 1m:  43.002

post 1m:  48.191

pre 5m: 111.158

post 5m:  117.112

Notes:  I'm assuming the slight slowdown here is do to the introduction of locking into add() and next() in the data bags.

Distinct

pig script:

a = load './studenttab10m';

b = group a all;

c = foreach b { c1 = distinct $1; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  14.927

post 100k:  14.134

pre 1m:  83.190

post 1m: 52.320

pre 5m:  744.834

post 5m:  216.043

Notes:  Data had about 90% distinct values, so 100k had about 90k distinct rows, etc.

Sorted

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b { c1 = order $1 by $0; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  16.964

post 100k: 12.895

pre 1m:  51.351

post 1m:  51.598

pre 5m:  236.669

post 5m:  225.688

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by Utkarsh Srivastava <ut...@yahoo-inc.com>.

Ok, all sounds good.

On Jan 7, 2008, at 8:30 AM, Alan Gates (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PIG-30? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12556616#action_12556616 ]
>
> Alan Gates commented on PIG-30:
> -------------------------------
>
> Responses to Utkarsh's comments:
>
> 0.  TreeSet.add() only adds an element if it is not already present  
> (see http://java.sun.com/j2se/1.5.0/docs/api/java/util/ 
> TreeSet.html#add(E)).  This guarantees that the element already in  
> the tree will not be obliterated.  That's why if that call returns  
> false, the code goes back and rereads from the file it read the  
> last element from.  This guarantees that we read from that file  
> until either the file is empty or we find a new unique element to  
> put in the TreeSet.
>
> 1.  Good catch, I'll add a hashcode() implementation for DataBag.
>
> 2.  They aren't quite as combinable as they first appear.  The code  
> in next() is identical, and could be combined.   
> DistinctDataBag.readFromTree() and SortedDataBag.readFromPriorityQ 
> () create different containers and access them differently.  I  
> could put just the create and access methods in each and combine  
> the rest of the logic.  The addToQueue() functions in each are  
> different and have different logic about how to add an element to  
> the queue.   I can work on this, but it may be a bit before I get  
> to it.
>
>> Get rid of DataBag and always use BigDataBag
>> --------------------------------------------
>>
>>                 Key: PIG-30
>>                 URL: https://issues.apache.org/jira/browse/PIG-30
>>             Project: Pig
>>          Issue Type: Bug
>>          Components: data
>>            Reporter: Benjamin Reed
>>            Assignee: Alan Gates
>>         Attachments: bagrewrite.patch
>>
>>
>> We should never use DataBag directly; instead, we should always  
>> use BigDataBag. I think we already do this. The problem is that  
>> the logic in BigDataBag is hard to follow and it is made more  
>> complicated because it subclasses DataBag. We should merge these  
>> two classes together.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556616#action_12556616 ] 

Alan Gates commented on PIG-30:
-------------------------------

Responses to Utkarsh's comments:

0.  TreeSet.add() only adds an element if it is not already present (see http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html#add(E)).  This guarantees that the element already in the tree will not be obliterated.  That's why if that call returns false, the code goes back and rereads from the file it read the last element from.  This guarantees that we read from that file until either the file is empty or we find a new unique element to put in the TreeSet.

1.  Good catch, I'll add a hashcode() implementation for DataBag.

2.  They aren't quite as combinable as they first appear.  The code in next() is identical, and could be combined.  DistinctDataBag.readFromTree() and SortedDataBag.readFromPriorityQ() create different containers and access them differently.  I could put just the create and access methods in each and combine the rest of the logic.  The addToQueue() functions in each are different and have different logic about how to add an element to the queue.   I can work on this, but it may be a bit before I get to it.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-30.
---------------------------

    Resolution: Fixed

Fix checked in, revision 609048

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556704#action_12556704 ] 

Olga Natkovich commented on PIG-30:
-----------------------------------

+1

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-30:
--------------------------

    Attachment: addhashcode.patch

In response to Utkarsh's comments, added more comments to the code and dealt with the hash code issue.  I also fixed an issue in DataBag.compareTo that I found as a result of thinking about the need for overriding hashCode().  compareTo() wasn't properly handling equals, which would have meant errors in distinct data bags.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551085 ] 

Benjamin Reed commented on PIG-30:
----------------------------------

With the Spill interface, the memory is never checked on add. Instead a memory manager will call spill() explicitly on the databag to make it spill when it determines that memory is low.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Utkarsh Srivastava (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556172#action_12556172 ] 

Utkarsh Srivastava commented on PIG-30:
---------------------------------------

Great job! This was a fairly large chunk of work.

It will be nice to have a few more comments. Specifically, one part that is implicit is that bag behavior is undefined if you add() to Databag after opening an iterator(). Alan and I talked about this.

Other issues:

0. TreeSet used in DistinctBag while merging files. But TContainer compares only based on tuple equality. Once you add a tuple equal to the one already in the treeset but from another input, one of the inputs will get eliminated from the treeset and never be read again. Am I missing something?

1. HashSet<> in DistinctBag. For hash set to work properly we need hashcode() methods to work properly. Since Tuple.hashcode() calls hashcode() on all its fields, all Datums should have a hash code. Databag doesn't have one which implies that DistinctBag wont work with nested data.

2. Spill() code in DistinctBag and sortedbag() is the same except that the former always uses the default comparator whereas sortedBag might use a specified comparator. Can we reuse code instead of duplicating?



> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555628#action_12555628 ] 

Benjamin Reed commented on PIG-30:
----------------------------------

Excellent job Alan! That was a lot of work! Just a couple of small comments:

*  Do we need to expose DefaultDataBag, SortedDataBag, and DistinctDataBag? We don't want people constructing them directly right? Maybe we should make them package protected.

* One reason to expose SortedDataBag would be to get the sort spec. Do we want to expose that?

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-30:
--------------------------

    Priority: Major  (was: Minor)

Based on bugs and complaints we are seeing from users, problems in the data bag implementation are causing a number of different issues.  I propose fixing several
issues:
1) Whether or not a bag needs to be sorted or distinct is known at bag creation time.  However, we always create the bag the same way and only sort or apply distinct
to the bag either when it is time to store it to disk or read from it.  It will be more efficient to subclass bag into three separate types, default, sorted, and
distinct and modify the bag factory to allow callers to create the correct type of bag up front.  Each type can then optimize their memory and disk storage.

2) The algorithm bags use to determine when to dump data to disk is not adequate.  This will be addressed in the bags by making use of the changes being done to fix
http://issues.apache.org/jira/browse/PIG-40

3) When merging back files from disk, the merge algorithm does not open enough files.  Performance testing done by the hadoop team found that 100 files was about an
optimal number.  We currently use 25.  As part of this fix we need to do our own performance testing and assure ourselves that 100 is at or near that inflection
point for us as well.

4) During the merge phase, when tuples are read off of disk, then placed in a HeapEntry, a new HeapEntry is created for each tuple.  A large number of object
creations could be saved by pooling these HeapEntry objects and reusing them.  Also, HeapEntry contains a reference to an Iterator<Tuple>.  This does not appear to
be used and should be removable.

To address these changes, BagFactory, BigDataBag, and DataBag will be significanly reworked.  BigDataBag will go away, with the understanding that all bags can spill
to disk as necessary.  DataBag will become an abstract class.  Three new classes will be introduced:  DefaultDataBag, SortedDataBag, and DistinctDataBag, all of
which will extend DataBag.

For the memory management changes related to PIG-40, it is assumed that something like the following interface will be introduced:

interface Spillable {
	/**
	 * Instructs an object to spill whatever it can to disk and release
	 * references to any data structures it spills.
	 */
	void spill();

	/**
	 * Requests that an object return an estimate of its in memory size.
	 * @returns estimated in memory size.
	 */
	long getMemorySize();
}

BagFactory's interface will change to be:

public class BagFactory {
	private static BagFactory self;

	/**
	 * Get a reference to the singleton factory.
	 */
	public static BagFactory getFactory();

	/**
	 * Get a default (unordered, not distinct) data bag.
	 */
	public DataBag newDefaultBag();

	/**
	 * Get a sorted data bag.
	 * @param spec EvalSpec that controls how the data is sorted.
	 * If null, default comparator will be used.
	 */
	public DataBag newSortedBag(EvalSpec spec);

	/**
	 * Get a distinct data bag.
	 */
	public DataBag newDistinctBag();
}

DataBag's interface will be:

public abstract class DataBag implements Writable, Spillable {
	// Containers that holds the tuples.  Actual object instantiated by subclasses.
	protected Collection<Tuple> contents;

	/**
	 * Get the number of elements in the bag, both in memory and on disk.
	 */
	public abstract long size();

	/**
	 * Find out if the bag is sorted.
	 */
	public abstract boolean isSorted();

	/**
	 * Find out if the bag is distinct.
	 */
	public abstract boolean isDistinct();

	/**
	 * Get an iterator to the bag.  For default and distinct bags,
	 * no particular order is guaranteed.  For sorted bags the order
	 * is guaranteed to be sorted according
	 * to the provided comparator.
	 */
	public abstract Iterator<Tuple> content();

	/**
	 * Add a tuple to the bag.
	 * @param t tuple to add.
	 */
	public void add(Tuple t);

	/**
	 * Add contents of a bag to the bag.
	 * @param b bag to add contents of.
	 */
	public void addAll(DataBag b);

	// Do I need remove, I couldn't find it used anywhere.

	/**
	 * Return the size of memory usage.
	 */
	public long getMemorySize();

	/**
	 * Write a bags contents to disk.  This won't change significantly
	 * from the current implementation, except that it will need to record
	 * the type of bag begin written.
	 * @param out DataOutput to write data to.
	 * @throws IOException (passes it on from underlying calls).
	 */
	public void write(DataOutput out) throws IOException;

	/**
	 * Read a bag from disk.  This won't change significantly from
	 * the current implementation, except that it will need to read the
	 * bag type and use the BagFactory to create the correct type of bag.
	 * @param in DataInput to read data from.
	 * @throws IOException (passes it on from underlying calls).
	 */
	static DataBag(DataInput in) throws IOException;

	// The old databag had a markStale() call here, but it's a NOP.
	// Does it need to be preserved?

	/**
	 * Write the bag into a string.  This will not change significantly
	 * from the current implementation.
	 */
	@Override
	public String toString();
}

public class DefaultDataBag extends AbstractDataBag {
	// Will set contents to be an ArrayList.

	// A custom iterator to handle getting data from memory and/or disk
	private class AbstractDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}
	
public class SortedDataBag extends AbstractDataBag {
	// Will set contents to be a PriorityQueue.  Experimentation found it to
	// to be faster to store this in a PriorityQueue up front rather than
	// store it in a List and then call Collections.sort() on it.

	// A custom iterator to handle getting data from memory and/or disk
	private class SortedtDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}
	
public class DistinctDataBag extends AbstractDataBag {
	// Will set contents to be a HashSet.  A little experimentation 
	// found that it was significantly faster to store distinct 
	// values in a hash set and sort them before the spill rather 
	// than store them in a TreeSet so that no sort is needed at spill
	// time.  This is also good because if the bag never spills we don't
	// waste time sorting it.

	// A custom iterator to handle getting data from memory and/or disk
	private class DistincttDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}


A getMemorySize() will need to be added to each of the other data types to allow the bag to make a guess at its memory usage.



> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.