You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2007/12/12 18:16:43 UTC
[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

     [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-30:
--------------------------

    Priority: Major  (was: Minor)

Based on bugs and complaints we are seeing from users, problems in the data bag implementation are causing a number of different issues.  I propose fixing several
issues:
1) Whether or not a bag needs to be sorted or distinct is known at bag creation time.  However, we always create the bag the same way and only sort or apply distinct
to the bag either when it is time to store it to disk or read from it.  It will be more efficient to subclass bag into three separate types, default, sorted, and
distinct and modify the bag factory to allow callers to create the correct type of bag up front.  Each type can then optimize their memory and disk storage.

2) The algorithm bags use to determine when to dump data to disk is not adequate.  This will be addressed in the bags by making use of the changes being done to fix
http://issues.apache.org/jira/browse/PIG-40

3) When merging back files from disk, the merge algorithm does not open enough files.  Performance testing done by the hadoop team found that 100 files was about an
optimal number.  We currently use 25.  As part of this fix we need to do our own performance testing and assure ourselves that 100 is at or near that inflection
point for us as well.

4) During the merge phase, when tuples are read off of disk, then placed in a HeapEntry, a new HeapEntry is created for each tuple.  A large number of object
creations could be saved by pooling these HeapEntry objects and reusing them.  Also, HeapEntry contains a reference to an Iterator<Tuple>.  This does not appear to
be used and should be removable.

To address these changes, BagFactory, BigDataBag, and DataBag will be significanly reworked.  BigDataBag will go away, with the understanding that all bags can spill
to disk as necessary.  DataBag will become an abstract class.  Three new classes will be introduced:  DefaultDataBag, SortedDataBag, and DistinctDataBag, all of
which will extend DataBag.

For the memory management changes related to PIG-40, it is assumed that something like the following interface will be introduced:

interface Spillable {
	/**
	 * Instructs an object to spill whatever it can to disk and release
	 * references to any data structures it spills.
	 */
	void spill();

	/**
	 * Requests that an object return an estimate of its in memory size.
	 * @returns estimated in memory size.
	 */
	long getMemorySize();
}

BagFactory's interface will change to be:

public class BagFactory {
	private static BagFactory self;

	/**
	 * Get a reference to the singleton factory.
	 */
	public static BagFactory getFactory();

	/**
	 * Get a default (unordered, not distinct) data bag.
	 */
	public DataBag newDefaultBag();

	/**
	 * Get a sorted data bag.
	 * @param spec EvalSpec that controls how the data is sorted.
	 * If null, default comparator will be used.
	 */
	public DataBag newSortedBag(EvalSpec spec);

	/**
	 * Get a distinct data bag.
	 */
	public DataBag newDistinctBag();
}

DataBag's interface will be:

public abstract class DataBag implements Writable, Spillable {
	// Containers that holds the tuples.  Actual object instantiated by subclasses.
	protected Collection<Tuple> contents;

	/**
	 * Get the number of elements in the bag, both in memory and on disk.
	 */
	public abstract long size();

	/**
	 * Find out if the bag is sorted.
	 */
	public abstract boolean isSorted();

	/**
	 * Find out if the bag is distinct.
	 */
	public abstract boolean isDistinct();

	/**
	 * Get an iterator to the bag.  For default and distinct bags,
	 * no particular order is guaranteed.  For sorted bags the order
	 * is guaranteed to be sorted according
	 * to the provided comparator.
	 */
	public abstract Iterator<Tuple> content();

	/**
	 * Add a tuple to the bag.
	 * @param t tuple to add.
	 */
	public void add(Tuple t);

	/**
	 * Add contents of a bag to the bag.
	 * @param b bag to add contents of.
	 */
	public void addAll(DataBag b);

	// Do I need remove, I couldn't find it used anywhere.

	/**
	 * Return the size of memory usage.
	 */
	public long getMemorySize();

	/**
	 * Write a bags contents to disk.  This won't change significantly
	 * from the current implementation, except that it will need to record
	 * the type of bag begin written.
	 * @param out DataOutput to write data to.
	 * @throws IOException (passes it on from underlying calls).
	 */
	public void write(DataOutput out) throws IOException;

	/**
	 * Read a bag from disk.  This won't change significantly from
	 * the current implementation, except that it will need to read the
	 * bag type and use the BagFactory to create the correct type of bag.
	 * @param in DataInput to read data from.
	 * @throws IOException (passes it on from underlying calls).
	 */
	static DataBag(DataInput in) throws IOException;

	// The old databag had a markStale() call here, but it's a NOP.
	// Does it need to be preserved?

	/**
	 * Write the bag into a string.  This will not change significantly
	 * from the current implementation.
	 */
	@Override
	public String toString();
}

public class DefaultDataBag extends AbstractDataBag {
	// Will set contents to be an ArrayList.

	// A custom iterator to handle getting data from memory and/or disk
	private class AbstractDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}
	
public class SortedDataBag extends AbstractDataBag {
	// Will set contents to be a PriorityQueue.  Experimentation found it to
	// to be faster to store this in a PriorityQueue up front rather than
	// store it in a List and then call Collections.sort() on it.

	// A custom iterator to handle getting data from memory and/or disk
	private class SortedtDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}
	
public class DistinctDataBag extends AbstractDataBag {
	// Will set contents to be a HashSet.  A little experimentation 
	// found that it was significantly faster to store distinct 
	// values in a hash set and sort them before the spill rather 
	// than store them in a TreeSet so that no sort is needed at spill
	// time.  This is also good because if the bag never spills we don't
	// waste time sorting it.

	// A custom iterator to handle getting data from memory and/or disk
	private class DistincttDataBagIterator implements Iterator<Tuple> { ... }

	// See above for comments on these
	public abstract long size();
	public abstract boolean isSorted();
	public abstract boolean isDistinct();
	public abstract Iterator<Tuple> content();

	/**
	 * Spill contents to disk.
	 */
	public void spill();
}


A getMemorySize() will need to be added to each of the other data types to allow the bag to make a guess at its memory usage.



> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think we already do this. The problem is that the logic in BigDataBag is hard to follow and it is made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.