You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Charlie Groves (JIRA)" <ji...@apache.org> on 2008/01/09 04:08:33 UTC

[jira] Created: (PIG-55) Allow user control over split creation

Allow user control over split creation
--------------------------------------

                 Key: PIG-55
                 URL: https://issues.apache.org/jira/browse/PIG-55
             Project: Pig
          Issue Type: Improvement
            Reporter: Charlie Groves


I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: replaceable_PigSplit.diff

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574311#action_12574311 ] 

groves edited comment on PIG-55 at 3/2/08 3:50 PM:
-----------------------------------------------------------

Updates my previous patch to use the generic DataStorage classes instead of Hadoop specific code.  This fixes issues a and b that you raised.  Individual backends will have to implement something to create a Chunker and hook Chunks into their processing setup like PigInputFormat and ChunkWrapper in the patch do for hadoop, but any implementations of Chunk and Chunker should be backend agnostic.  As a bonus, PigChunker in the patch implements the default file selection and LoadFunc processing that used to be in PigInputFormat, but it should be instantiable by any backend so they can pick up normal Pig processing for free.

I think c and d are outside of the scope of this patch.  Both of those problems relate to sharing code to process the actual bytes from a Chunk, and that can be built on top of this change.  This is only concerned with exposing the determination of what files to read and what code should read them to user code from pig.

I made some minor modifications to the DataStorage code to allow easier access to the properties on an individual element as its actual type.  It seemed ridiculous to turn longs into strings only to immediately turn them back into longs all over the place.

The patch passes all of the tests for me.  It's awesome to go away for a month, svn update, and have the tests take a fifth the time to run that they used to.

      was (Author: groves):
    Updates my previous patch to use the generic DataStorage classes instead of Hadoop specific code.
  
> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583709#action_12583709 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

+1 Fantastic work Charlie! If there are no objections this morning, I will commit at lunch. (In 4 hours)

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584753#action_12584753 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

All tests passed. Changes committed. Charlie, thanks for contributing and for your patience!


> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, pig_chunker_split_v7.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583952#action_12583952 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

Charlie,

All end-to-end tests passed and all but one unit test passed. I got the following error. I tried to fix it buy deriving from TestCase but that did not help. Can you please take a look.

ucdev3:~/src/pig-apache2/trunk> java -cp lib/junit-4.1.jar:pig.jar:build/test/classes/  junit.textui.TestRunner org.apache.pig.test.TestParser
.F
Time: 0.002
There was 1 failure:
1) warning(junit.framework.TestSuite$1)junit.framework.AssertionFailedError: No tests found in org.apache.pig.test.TestParser

FAILURES!!!
Tests run: 1,  Failures: 1,  Errors: 0



> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

        Fix Version/s: 0.1.0
    Affects Version/s: 0.0.0
               Status: Patch Available  (was: Open)

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PIG-55) Allow user control over split creation

Posted by Utkarsh Srivastava <ut...@yahoo-inc.com>.

Antonio, can you review this patch before it is committed? I can see  
it affecting the abstraction layer a lot.
On Jan 16, 2008, at 3:48 PM, Alan Gates (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PIG-55? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12559743#action_12559743 ]
>
> Alan Gates commented on PIG-55:
> -------------------------------
>
> All the unit and end to end tests pass.  Ben, as long as you're  
> ready to give a +1, I can commit this.
>
>> Allow user control over split creation
>> --------------------------------------
>>
>>                 Key: PIG-55
>>                 URL: https://issues.apache.org/jira/browse/PIG-55
>>             Project: Pig
>>          Issue Type: Improvement
>>            Reporter: Charlie Groves
>>         Attachments: replaceable_PigSplit.diff,  
>> replaceable_PigSplit_v2.diff
>>
>>
>> I have a dataset in HDFS that's stored in a file per column that  
>> I'd like to access from pig.  This means I can't use LoadFunc to  
>> get at the data as it only allows the loader access to a single  
>> input stream at a time.  To handle this usage, I've broken the  
>> existing split creation code out into a few classes and  
>> interfaces, and allowed user specified load functions to be used  
>> in place of the existing code.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559743#action_12559743 ] 

Alan Gates commented on PIG-55:
-------------------------------

All the unit and end to end tests pass.  Ben, as long as you're ready to give a +1, I can commit this.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583954#action_12583954 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

I think you're using the Junit 3 test runner and it's a Junit 4 test.  It passes for me in ant with ant -Dtestcase=TestParser test or with java -classpath lib/junit-4.1.jar:pig.jar:build/test/classes/ org.junit.runner.JUnitCore org.apache.pig.test.TestParser straight from the commandline

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584247#action_12584247 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

Charlie, I think it would be good to stick with JUnit 3 for now.

I don't necessarily see the need to change the existing tests if they are passing since eventually we would move to the newer version.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584242#action_12584242 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

Ahh, I didn't realize it was failing from ant for you as well.  Ant didn't get support for JUnit 4 tests until 1.7, so if you're running an earlier version, its runner won't pick them up.  I assumed JUnit 4 style tests were acceptable for pig since several tests are using the @Test, @Before and @After annotations, but it looks all of those tests are also using the JUnit 3 style TestCase extension and method naming conventions.  If we don't want to require Ant 1.7, I'm fine with converting the test to be JUnit 3 style.  However, if we're going to do that, I'd like to fix the existing tests so they don't use both JUnit 3 and 4 styles.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v3.patch

pig_chunker_split_v3.patch adds a simple test for the new functionality.  I've also renamed the new interfaces from Chunk and Chunker to Slice and Slicer.  I like the new names better since slice is actually a verb, so Slicer makes more sense as a noun than Chunker.

I've also made a couple changes to file handling in pig to allow the Slicer to fully control the loading process.  I moved the file existence check on the location passed to a LOAD statement from the parser to the new default loading code, PigSlicer.  I don't think it makes much sense for the parser to be checking on the filesystem, and this allows the creation of Slicers that don't
have any files on the filesystem like the one I made for the test.  Code using the existing LOAD mechanisms will still get an exception and fail because of the lack of file existence before the map reduce job truly starts.  I also removed the absolutizing of files in  MapreducePlanCompiler.  There's actually no need for it to be done there as DataStorage will correctly absolutize a location whenever it's given a relative path, and it doesn't make any sense to prepend a file path to a location that isn't actually on the file system, so the decision to absolutize needs to be left up to the Slicer.



> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584330#action_12584330 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

I tried to apply the patch after committing the streaming milestone and unfortunately that patch does not apply. I tried to merge things manually but I am not sure if the resulting code is correct.

Charlie, could you regenerate one more time. Sorry for this and thanks for your patients. I promise yours is the next patch that is going in :).

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584223#action_12584223 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

Ben volunteered to sort our the unit test issue. Thanks, Ben. He will be committing the patch once he is done

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584233#action_12584233 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

The unit test issue is really straightforward.  In your command, you're using junit.textui.TestRunner to run the tests.  It's specifically for run subclasses of TestCase and other JUnit 3 style unit tests.  It fails because TestParser is a JUnit 4 style test and it can't find any of the old style test cases in it.  You need to switch to the new runner, org.junit.runner.JUnitCore, to run tests that just use annotations and don't follow the JUnit 3 naming and subclassing conventions.  If you use that, the test will pass.  It also runs and passes as part of ant test, so there's no need to run it directly anyway.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v4.patch

pig_chunker_split_v4.patch adds a validate method to Slicers that is called when the Slicer is parsed to allow it to throw an exception if a location isn't parsable or if it doesn't exist on the dfs.  It also adds a new test case, TestParser, that checks that an exception is raised at parse time by the default slicer if the location it's given doesn't exist.  TestParser looks a little slim with a single test method in it, but it didn't seem like the test fit in with any of the other existing test cases.

All the tests pass for me with the patch in place.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559177#action_12559177 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

The openDFS removal was accidental.  I should be able to add it back by adding something to get at the JobConf from PigSplitWrapper

I don't like that split returns a RecordReader where the first field is unused either, but the dependency on Hadoop is locked in deeper than that.  PigSplitFactory needs to take a JobConf so its subclasses can get at the right HDFS to lookup the files in location it needs to split.  PigSplit itself extends InputSplit, another hadoop class, so if we're removing any references to hadoop, we'd need to make an interface like InputSplit that exposes getLength and getLocations since those things can't be figured out externally from the split.  We'd also need to have some concept like Writable so the split can be sent over the wire.  The RecordReader interface returned by the split has the same problem:  getPos, close, and getProgress need to be handled by user code and can't be inferred by pig.  I feel like the complexity added to make interfaces that are really similar to hadoop's is worse than the loss of generality from using hadoop's interfaces, especially when the outmost layer of code, PigSplitFactory, is going to need access to one hadoop class no matter what.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583957#action_12583957 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

Right, and JUnit 4 includes the JUnit 3 runner which you're using.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562159#action_12562159 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

Ah you are correct. I missed that you pass whole directories to the split factory rather than just files. Excellent.

I'm hoping PIG-32 lands soon. (Pigs aren't good at flying :P ) It's things are starting to block on it.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583754#action_12583754 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

I tried to apply the patch but several files got rejected.

Charlie, could you regenerate the patch from the latest trunk. Thanks.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-55:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, pig_chunker_split_v7.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: replaceable_PigSplit_v2.diff

This version of the patch moves PigSplit and PigSplitFactory into the main pig package, and greatly reduces the size of the interfaces.  Now neither interface references code from in the impl packages.

Having now looked at EvalSpec, I realize it's probably not the right thing to expose to allow access to the fields used from a particular input, so I'm going to leave that part of my use case out of this patch.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582514#action_12582514 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

+1 Looks great Charlie. I believe the patches to FileLocalizer and QueryParser.jjt are inadvertent correct? I can remove them before I commit, but I wanted to make sure there wasn't some dependency I was missing.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561856#action_12561856 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

Sorry, to take so long to comment. I was hoping to take a swipe at this, but I haven't been able to get the time.

We cannot expose Hadoop classes in Pig. There are other backends that Pig runs on and we don't want to pull all of Hadoop with us.

Antonio has a generalized file access layer PIG-32 that we should incorporate with. PigSplit is an internal class specific to Hadoop, so we shouldn't expose that.

At a higher level, there is something else I would like to be able to do as well: multi file splits. The notion that a split never spans a file is problematic when files are small. It seems like we should be more flexible in that area. We also need fileless splits for load functions that generate tuples "from thin air".

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Sam Pullara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557334#action_12557334 ] 

Sam Pullara commented on PIG-55:
--------------------------------

+1 for adding something like this to the main line.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559104#action_12559104 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

This looks really good. Let me see if I can get the test cases run against it.

The openDFSFile removal seems unrelated. Correct? Or do you really mean to remove it?

The other nit I have is that the split returns RecordReader<Text, Tuple>. There are two issues with this:

1) The first field is not used.
2) RecordReader and Text locks in a dependency on Hadoop. It would be nice if the split interface could still be valid if Pig was on top of another system. Iterator<Tuple> seems more general and precise. What do you think? (Granted we would have to wrap this general iterator into a Hadoop RecordRecord for Hadoop, but at least it gives a nice interface to the programmer.)



> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583743#action_12583743 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

Ben, please, hold off on the commit. Let me apply the patch and run end-to-end test before you do the commit

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split.patch

Updates my previous patch to use the generic DataStorage classes instead of Hadoop specific code.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577253#action_12577253 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

bq. 1) PigContext isn't part of the public API, so it would probably be best if the LoadFunc was passed to the Chunk rather than relying on the Chunk class to construct if needed.

This is probably clearer in the version of the patch with Javadoc, but there is no requirement that a Chunk use LoadFuncs internally.  The whole reason I'm writing this is to get around the requirement of using a single stream to read tuples in LoadFunc.  For user implementations of Chunker, I expect them to create their own Chunks as well that do Tuple creation in next.  They won't touch LoadFunc at all, and in any case, the class name string usually used to create a LoadFunc is being used to create a Chunker, so there won't be a LoadFunc available automatically.  This isn't to say that Chunks can't use LoadFuncs, just that they aren't expected to.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561865#action_12561865 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

So should I just wait for PIG-32 to land before doing anything more with this?  How far out is that?

I don't see how the patch as it stands prohibits multi-file splits.  The PigSplits returned by PigSplitFactory can be configured however the factory likes, so a single split can do part of a file, a whole file, many files or invisible files as seen fit by the factory.  

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583956#action_12583956 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

Cahrlie, as you can see in my command, I have junit-4.1jar in my classpath

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v2.patch

Looks like I uploaded an earlier version of the patch without Javadoc.  This version removes some superfluous cleanup I did and adds Javadoc for Chunk and Chunker.  I still need to go back and add some unit tests for Chunk and Chunker though.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584719#action_12584719 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

I successfully applied the patch and running tests now. Will commit as soon as all the tests pass

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, pig_chunker_split_v7.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567085#action_12567085 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

I just went over this with Antonio this morning. I think the functionality is very important, but there are a couple of things that bother me.

a) The biggest one is the dependence on Hadoop classes. I think that is the easiest to fix.
b) Do Split factories also need to figure out where to schedule things? It seems very platform specific. In general it seems that specifying the files used by the split will allow Pig to figure out the best way to place the processing.
c) Another one is the binding to load functions. It's reasonable to say that LoadFunctions should know how to split, but the binding seems tight. For example, if you have a set of URLs in a file separated with line feeds do you want to have to write a new LoadFunction just so that you can split it in a different way (maybe finer cuts for example)?
d)Do you also want to put all the logic to handle compressed files in each split factory? Potentially you may want to combine splits together: one chops at block/compression boundaries. Followed by another split that chops even finer or perhaps puts splits together.

I'm not sure how to address c) and d), but for a) and b) I think we can tweak your proposal slightly:

{noformat}
class FileChunk {
    long length;
    long offset;
    String filename;
}

interface Split implements Serializable {
    FileChunk[] getChunks();
}

interface SplitFactory {
    Split[] getSplits(String input);
}

{noformat}

We would include the logic that you propose to check the LoadFunction to see if it implements SplitFactory and use it to generate the splits if it does. I think this is generic. The FileChunks lets us do placement without requiring the splits to worry about block locations or DFS specific stuff.

I'm wondering about conveying splittable information about compressed files. We can split bzipped files and soon we will be able to do some kinds of gzipped files, so we need a nice way of conveying that information to the SplitFactory.

I am a bit stuck on separating splitting from parsing. I'm not proposing the following, but rather thinking out loud:

{noformat}
A = chop 'filename' using ChopFunction();
B = load A using ParseFunction();
C = group B by $1;
store C into 'blah';
{noformat}

or simply
store (group (load (chop 'filename' using ChopFunction()) using ParseFunction) by $1) into 'blah';

(We would need to use "chop" since "split" is already a keyword.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577123#action_12577123 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

Great work Charlie! I like it. Couple of details:

1) PigContext isn't part of the public API, so it would probably be best if the LoadFunc was passed to the Chunk rather than relying on the Chunk class to construct if needed.

2) It would be nice if the compressed handling could be done outside the Chunk class so that programmers don't have to boiler plate it. (I'm not sure there is a nice way to do it, so I'm fine with blowing this off for now.)

3) Javadoc is needed for the Chunk and Chunker classes. The interaction between the LoadFunc and the Chunk/Chunker classes needs to be well documented.

4) You should put in a test case for a user defined Chunker and Chunk class. (When InputSplits were first put into Hadoop, it worked for the builtin classes but failed for user defined Splits).

Alan can you check this out? I'd like to commit this soon. I don't think it should effect your pipeline work too much.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v6.patch

Alright, I'm uploading pig_chunker_split_v6.patch which converts TestParser to a Junit 3 style test.

There are several reasons to removal the JUnit 4 annotations from the existing tests:
# Since they're not being used to run the tests, it can't be assumed that they're actually on all the tests and we can just switch JUnit 4 on and use them.  They're definitely only on a mishmash of the existing tests, so to make that switch, you're going to have to go through and check the annotations on all of the tests anyway.
# If you're going to use the annotations, its better to rename the methods and add the annotations than sticking with the stilted test prefix method names.
# It confuses people outside the project like me into writing JUnit 4 style tests.  In PIG-145 I've got a fair number of JUnit 4 tests that I need to convert now, some of which use features like the test parameterization which is going to be more significant to retrofit.
# It's just pointless noise in the code until they're actually doing something.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v7.patch

I'm uploading pig_chunker_split_v7.patch which is up to date with trunk as of r643987 and passes all the tests.  When the new SPLIT BY FILE syntax is used, I inform PigSlicer that it shouldn't split files which is the same as its behavior in PigInputFormat prior to this patch.  Other slicers aren't told of the splittable value.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, pig_chunker_split_v6.patch, pig_chunker_split_v7.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Groves updated PIG-55:
------------------------------

    Attachment: pig_chunker_split_v5.patch

Sorry about that, forgot to update before adding to the patch.  This version compiles against and passes all tests on trunk as of r643141.

I'm a little surprised to see that all instances of String concatenation using + have been replaced with StringBuilder.  That's so much harder to read and Java compiles + concatenation with StringBuffer.  While StringBuffer is synchronized unlike StringBuilder, the JVM is smart enough to realize when a variable is only accessible from a single thread and optimize most of the locking penalty away for anything that's heavily accessed.  I would expect to see StringBuilder used in an inner loop, not in exception String construction.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584240#action_12584240 ] 

Olga Natkovich commented on PIG-55:
-----------------------------------

I just use our current build.xml for running tests and it fails. If we do want to reansition to JUnit 4 style, the build.xml need to be changed to reflect that.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, pig_chunker_split_v4.patch, pig_chunker_split_v5.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-55) Allow user control over split creation

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-55:
------------------------------

    Patch Info:   (was: [Patch Available])

cleared "patch available" flags since this patch needs to be updated

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582521#action_12582521 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

No, the changes to FileLocalizer and QueryParser are both on purpose.  Check out my previous comment.  Since the slicer might not have any representation on the file system for its location, it doesn't make sense for the parser to check on the file's existence.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.0.0
>            Reporter: Charlie Groves
>             Fix For: 0.1.0
>
>         Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch, pig_chunker_split_v3.patch, replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557453#action_12557453 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

I like the idea, I think the PigSplitFactory interface (it should not be under impl btw) gets too much information. I think the programmer should just get input and desired number of splits. We should then build splits for Hadoop from that information. PigSplit is just a glue class (and a rather ugly one at that).

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-55) Allow user control over split creation

Posted by "Charlie Groves (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557479#action_12557479 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

Ahh, I left it under impl because it's something that only matters to the mapreduce portions of pig, but I guess all user implementable interfaces are going to be in the top level pig package?

I can see what you're saying about PigSplitFactory getting too much information, but my next step after this was going to be to figure out how to expose the actual fields used on the loaded values to the split factory.  Since my data is broken out by columns, if I know the accessed fields, I can only load the data necessary for those fields which will be a huge speedup.  I was thinking I could extract that data from groupbySpec and evalSpec.  Is there a better way to do this?

Regardless of that, the JobConf can be accessed from the PigContext, and the index value doesn't bear any relevance outside of pig's internals, so I can drop those parameters.  I can also remove the getEvalSpec, getGroupbySpec and getIndex methods from the PigSplit interface and handle that internally without encumbering user created splits.  However, the PigSplit interface can't go away altogether because the PigSplitFactory has to be able to return the actual splits so they can handle the getLength and getLocations methods appropriately for the hdfs files they're loading, and so they can create the actual RecordReader method with makeReader.  Since that's particular to the style of loading the split factory is implementing, there's no way to do it generically from pig.

Another patch forthcoming along these lines.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to access from pig.  This means I can't use LoadFunc to get at the data as it only allows the loader access to a single input stream at a time.  To handle this usage, I've broken the existing split creation code out into a few classes and interfaces, and allowed user specified load functions to be used in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.