You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2009/11/12 03:04:39 UTC

[jira] Commented: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface

    [ https://issues.apache.org/jira/browse/PIG-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776792#action_12776792 ] 

Thejas M Nair commented on PIG-1088:
------------------------------------

*Problem* : With old load/store interface, the index created by MergeJoinIndexer consisted of tuples with join key(s), filename, offset. With the new load/store interface, the split index is available (RecordReader.getSplitIndex) instead of filename and offset . But there is no guarantee that split indexes are in sorted order of the file. If more than one split has tuples with same join key in it, it is necessary to know which split needs to be read first. 

*Proposal*: (thanks to Alan Gates)
We should add an interface to the list of  load interfaces:

public interface LoadOrderedInput {

     WritableComparable getPosition();
}

If the load function implements this interface it can then be used in  a merge join.  This getPosition call could then be called in the map phase of the sampling MR job and the tuples in the index will have the sort(/join) key(s) followed by the resulting value. 
In sorting the index in the reduce phase of the sampling MR job, this value will then be used.

For LoadFuncs that use FileInputFormat,  getPosition can return the following class:

public class TextInputOrder implements WritableComparable {

	private String basename;  // basename of the file
	private long offset;              // offset at which this split starts

         int compareTo(TextInputOrder other) {
         	int rc = basename.compareTo(other.basename)
		if (rc == 0) rc = offset.compareTo(other.offset);
		return rc;
	}
}

This means that we would take the filenames sorted lexigraphically  (which will work for things like part-00000, map-00000, bucket001 (warehouse data), etc.) and then offsets into those files after that.   
To make it easier for authors of new LoadFuncs to implement this interface, implementation of this interface for load functions that use FileInputFormat  will be provided through an abstract base class. 


> change merge join and merge join indexer to work with new LoadFunc interface
> ----------------------------------------------------------------------------
>
>                 Key: PIG-1088
>                 URL: https://issues.apache.org/jira/browse/PIG-1088
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.