You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Benjamin Reed (JIRA)" <ji...@apache.org> on 2008/02/08 17:37:08 UTC
[jira] Commented: (PIG-55) Allow user control over split creation

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567085#action_12567085 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

I just went over this with Antonio this morning. I think the functionality is very important, but there are a couple of things that bother me.

a) The biggest one is the dependence on Hadoop classes. I think that is the easiest to fix.
b) Do Split factories also need to figure out where to schedule things? It seems very platform specific. In general it seems that specifying the files used by the split will allow Pig to figure out the best way to place the processing.
c) Another one is the binding to load functions. It's reasonable to say that LoadFunctions should know how to split, but the binding seems tight. For example, if you have a set of URLs in a file separated with line feeds do you want to have to write a new LoadFunction just so that you can split it in a different way (maybe finer cuts for example)?
d)Do you also want to put all the logic to handle compressed files in each split factory? Potentially you may want to combine splits together: one chops at block/compression boundaries. Followed by another split that chops even finer or perhaps puts splits together.

I'm not sure how to address c) and d), but for a) and b) I think we can tweak your proposal slightly:

{noformat}
class FileChunk {
    long length;
    long offset;
    String filename;
}

interface Split implements Serializable {
    FileChunk[] getChunks();
}

interface SplitFactory {
    Split[] getSplits(String input);
}

{noformat}

We would include the logic that you propose to check the LoadFunction to see if it implements SplitFactory and use it to generate the splits if it does. I think this is generic. The FileChunks lets us do placement without requiring the splits to worry about block locations or DFS specific stuff.

I'm wondering about conveying splittable information about compressed files. We can split bzipped files and soon we will be able to do some kinds of gzipped files, so we need a nice way of conveying that information to the SplitFactory.

I am a bit stuck on separating splitting from parsing. I'm not proposing the following, but rather thinking out loud:

{noformat}
A = chop 'filename' using ChopFunction();
B = load A using ParseFunction();
C = group B by $1;
store C into 'blah';
{noformat}

or simply
store (group (load (chop 'filename' using ChopFunction()) using ParseFunction) by $1) into 'blah';

(We would need to use "chop" since "split" is already a keyword.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.