You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Joel Fouse (JIRA)" <ji...@apache.org> on 2013/03/23 08:29:15 UTC

[jira] [Updated] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree

     [ https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Fouse updated PIG-3258:
----------------------------

    Attachment: MultiStorageMultiIndex.patch

Thanks; attaching now.  This includes the modifications to MultiStorage as well as TestMultiStorage.

In a nutshell, the basic idea is to allow the second parameter, the splitFieldIndex, to specify e.g. something like "2/1/3" which would mean create the first level subdirectory based on the value in index 2, the second level based on the value in index 1, and the third level based on the value in index 3.  In the process I also cleaned up and expanded the class level javadoc to make it more readable and introduce the new capability.

One potential change I haven't made yet but would like feedback on is the resulting filename pattern.  When using regular PigStorage for output, the files look like /path/to/files/part-r-0001.  But currently MultiStorage uses the value in the field specified by splitFieldIndex to both create the subfolder as well as name the file, e.g. /path/to/files/a1/a1-0001.  That's okay, but if it now supports numerous levels of indexes and folder structures, you could end up with a file pattern like /path/to/files/Monday/breakfast/red/apples/Monday-breakfast-red-apples-0000.  Depending on how many levels the user wants to break things out into, the filename could start to get rather unwieldy.  Does it make sense to continue to include those values in the filename, or should it (as I would prefer) exhibit the same behavior as PigStorage and simply name the files as something like "part-r-[taskid]"?  Or is it that the output filenames need to be unique within the context of the whole job?  That might make some unfortunate sense.  I appreciate any insight in this regard.
                
> Patch to allow MultiStorage to use more than one index to generate output tree
> ------------------------------------------------------------------------------
>
>                 Key: PIG-3258
>                 URL: https://issues.apache.org/jira/browse/PIG-3258
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joel Fouse
>            Priority: Minor
>              Labels: piggybank
>         Attachments: MultiStorageMultiIndex.patch
>
>
> I have made a patch to enable MultiStorage to handle multiple tuple indexes, rather than only one, for generating the output directory structure.  Before I submit it, though, I need to know if I should generate the patch from /contrib/piggybank/java where I've been compiling and unit testing, or back at the project root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira