You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alex Rovner (Created) (JIRA)" <ji...@apache.org> on 2012/01/09 19:12:39 UTC

[jira] [Created] (PIG-2462) getWrappedSplit is incorrectly returning the fist split instead of the current split.

getWrappedSplit is incorrectly returning the fist split instead of the current split.
-------------------------------------------------------------------------------------

                 Key: PIG-2462
                 URL: https://issues.apache.org/jira/browse/PIG-2462
             Project: Pig
          Issue Type: Bug
            Reporter: Alex Rovner
             Fix For: 0.9.1, 0.11


If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:

    /**
     * This methods returns the actual InputSplit (as returned by the 
     * {@link InputFormat}) which this class is wrapping.
     * @return the wrappedSplit
     */
    public InputSplit getWrappedSplit() {
        return wrappedSplits[0];
    }


Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 

This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:


    /**
     * Get the record reader for the next chunk in this CombineFileSplit.
     */
    protected boolean initNextRecordReader() throws IOException, InterruptedException {

        if (curReader != null) {
            curReader.close();
            curReader = null;
            if (idx > 0) {
                progress += pigSplit.getLength(idx-1);    // done processing so far
            }
        }

        // if all chunks have been processed, nothing more to do.
        if (idx == pigSplit.getNumPaths()) {
            return false;
        }

        // get a record reader for the idx-th chunk
        try {
          

            curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
            LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));

            if (idx > 0) {
                // initialize() for the first RecordReader will be called by MapTask;
                // we're responsible for initializing subsequent RecordReaders.
                curReader.initialize(pigSplit.getWrappedSplit(idx), context);
                pigSplit.get
                loadfunc.prepareToRead(curReader, pigSplit);
            }
        } catch (Exception e) {
            throw new RuntimeException (e);
        }
        idx++;
        return true;
    }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Yulia Tolskaya (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189419#comment-13189419 ] 

Yulia Tolskaya commented on PIG-2462:
-------------------------------------

This bug also exist in pig 0.8
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>            Assignee: Alex Rovner
>             Fix For: 0.9.2, 0.10, 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Status: Patch Available  (was: Open)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183716#comment-13183716 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Thanks Aniket, I just realize the patch try to reuse splitIndex. We shall add a new variable idx to PigSplit, it is set by PigRecordReader.initNextRecordReader and consumed by PigSplit.getWrappedSplit, to track the current InputSplit. It has nothing to do with PigSplit.splitIndex.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183164#comment-13183164 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

splitIndex is the index of PigSplit, idx keeps track of current InputSplit within PigSplit. I feel changing wrappedSplits[0] into wrappedSplits[idx] should be enough.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Alex Rovner
>             Fix For: 0.9.1, 0.11
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2_0.9.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2_0.9.patch

PIG-2462-2_0.9.patch is the same patch for 0.9 branch.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2.patch

I may think too much. We don't need InputFormat in the test. We only need a LoadFunc. I attached patch with test case.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2_0.9.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2_0.9.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2_0.9.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184386#comment-13184386 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

bq. splitIndex within the PigInputFormat tracks the current PigSplit correct?
Yes
bq. What does splitIndex within the PigSplit track? (From my understanding it should track the current wrapped InputSplit)
It is the way PigSplit identify itself
bq. There is also inputIndex within PigSplit. Wouldn't that track the InputSplit index?
If a mapreduce job need more than 1 input (eg, join a, b, we have two input a & b in the same map), inputIndex tracks which input is it
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Attachment:     (was: split_fix_take2.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2_0.9.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Prashant Kommireddi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183514#comment-13183514 ] 

Prashant Kommireddi commented on PIG-2462:
------------------------------------------

How would this change affect MergeJoinIndexer.java which uses pigSplit.getWrappedSplit() ?
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184301#comment-13184301 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Hi, Alex,
We have two level of split, PigSplit and InputSplit, PigSplit is a wrap of several InputSplit. In PigInputFormat, we combine multiple InputSplit into one PigSplit. splitIndex track current PigSplit, idx track current InputSplit within PigSplit.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185145#comment-13185145 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Patch looks good. Test is a little complex, but is possible. We need to add a testcase.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189477#comment-13189477 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Yes, however, we don't have plan for another 0.8 release. Can you apply the patch to 0.8 branch and build by yourself?
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>            Assignee: Alex Rovner
>             Fix For: 0.9.2, 0.10, 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184370#comment-13184370 ] 

Alex Rovner commented on PIG-2462:
----------------------------------

Thanks Daniel for the info. 

Some questions:
splitIndex within the PigInputFormat tracks the current PigSplit correct?
What does splitIndex within the PigSplit track? (From my understanding it should track the current wrapped InputSplit)
There is also inputIndex within PigSplit. Wouldn't that track the InputSplit index?

Finally, do we need to introduce an "idx" in PigSplit or my patch would suffice?
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Attachment: split_fix_take2.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Prashant Kommireddi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183533#comment-13183533 ] 

Prashant Kommireddi commented on PIG-2462:
------------------------------------------

I did not find anything wrong either by looking at the code, just making sure.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Aniket Mokashi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183688#comment-13183688 ] 

Aniket Mokashi commented on PIG-2462:
-------------------------------------

splitIndex's definition is getting changed here. (its not newly added).
splitIndex is used for keeping track of Pigsplit itself. While reading records with pigrecordreader, should we really change this index?

>From MergeJoinIndexer code, consider a case where one pigsplit is associate with one wrappedsplit and we have a couple of pigsplit. Before we do wrapperTuple.set(keysCnt+1, pigSplit.getSplitIndex()); on line 179, we do loader.getnext(). I am not very sure about this, but this might down the stack hit PigRecordReader.initNextRecordReader that will reset the splitIndex on pigsplit to 0 for every pigsplit.

It would be safer to keep the idx on pigsplit as a separate variable and copy down from PigRecordReader as we need.

Thoughts?
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the fist split instead of the current split.

Posted by "Alex Rovner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182722#comment-13182722 ] 

Alex Rovner commented on PIG-2462:
----------------------------------

I am attempting to make a patch. Proposed fixes:

Add pigSplit.setSplitIndex(idx) before curReader.initialize(pigSplit.getWrappedSplit(idx), context);

change return wrappedSplits[0]; to  return wrappedSplits[splitIndex]; in PigSplit.getWrappedSplit();

                
> getWrappedSplit is incorrectly returning the fist split instead of the current split.
> -------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Alex Rovner
>             Fix For: 0.9.1, 0.11
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184177#comment-13184177 ] 

Alex Rovner commented on PIG-2462:
----------------------------------

One way to avoid this issue is to disable combinedinputformat in your pigs jobs.

I guess I am a bit confused about the comments on the splitIndex as I am not very familiar with PIG's code base. Is split index used elsewhere and is not really meant to track the index of the current pigsplit that we are reading? If so, I can certainly change the patch to include another variable "idx" as suggested to keep track of this value.

How ever judging from the PigInputFormat.getPigSplits code:
for (int i = 0; i < combinedSplits.size(); i++)
                pigSplits.add(createPigSplit(combinedSplits.get(i), inputIndex, targetOps, i, conf));

Seems like the intention was to use splitIndex to track the current split?

                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Attachment: split_fix_take2.patch

Attached the changes based on the comments
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Affects Version/s: 0.11
                       0.9.1
        Fix Version/s:     (was: 0.9.1)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Attachment: splitsfix.patch

Attaching patch generated by git format-patch. I couldn't verify all unit tests since there is currently an issue with them in trunk. Most of the unit tests passed and I have verified this patch with my loader which had the mentioned issue.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Alex Rovner
>             Fix For: 0.9.1, 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2_0.9.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183276#comment-13183276 ] 

Alex Rovner commented on PIG-2462:
----------------------------------

Daniel -- there is no  member idx in PigSplit.java. The index thats supposed to be tracked is splitIndex. Furthermore with combined input format, PigRecordReader does not increment the value of splitIndex when switching the reading from one split to the next even though it does increment and uses this index internally. Therefore if we just change wrappedSplits[0] to wrappedSplits[splitIndex] you will still have this issue. I have verified that splitIndex is not modified anywhere except through the constructor in PigSplit.java. 

I have made the needed code changes and have verified them with my loader. Now my log messages from my loader correspond to the log messages from PigRecordReader (originally that was not the case).
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Alex Rovner
>             Fix For: 0.9.1, 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185171#comment-13185171 ] 

Alex Rovner commented on PIG-2462:
----------------------------------

Is it possible to use CombinedInputFormat in PigUnit?
Any existing test you can point me to as an example?
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Aniket Mokashi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183748#comment-13183748 ] 

Aniket Mokashi commented on PIG-2462:
-------------------------------------

Yes. I am just wondering if there is a way to workaround this (to avoid porting it back in pig). I think pig-user list might be a better place to discuss it. Thanks for your comments.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.10
                   0.9.2
         Assignee: Alex Rovner
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

+1 for patch.

Unit test pass.

test-patch:
     [exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     -1 release audit.  The applied patch generated 510 release audit warnings (more than the trunk's current 502 warnings).

All new file has Apache header, ignore release audit warning.

Patch committed to 0.9/0.10/trunk.

Thanks Alex!
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>            Assignee: Alex Rovner
>             Fix For: 0.9.2, 0.10, 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment:     (was: PIG-2462-2_0.9.patch)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Aniket Mokashi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183723#comment-13183723 ] 

Aniket Mokashi commented on PIG-2462:
-------------------------------------

Hi Daniel,
Is there a way to work around this issue elegantly? Basically, information on split needs to be available to loadfunc. I can think of getting this at createRecordReader level on the inputformat returned by getInputFormat. But, how do I pass it down elegantly to the loadfunc from there.
Can you suggest an idea?
Thanks,
Aniket
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Patch Info: Patch Available
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183478#comment-13183478 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Yes, you are right. That's why you pass index from PigRecordReader to PigSplit. Your approach looks right. Except for the name "splitIndex", we usually refer it to the index of PigSplit itself, not the index of InputSplit inside PigSplit. It's better to use "idx" to make it less confused. 
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2_0.9.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2462:
----------------------------

    Attachment: PIG-2462-2.patch
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: PIG-2462-2.patch, PIG-2462-2_0.9.patch, split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Alex Rovner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rovner updated PIG-2462:
-----------------------------

    Summary: getWrappedSplit is incorrectly returning the first split instead of the current split.  (was: getWrappedSplit is incorrectly returning the fist split instead of the current split.)
    
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Alex Rovner
>             Fix For: 0.9.1, 0.11
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183741#comment-13183741 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

Correct me if wrong, but I thought this is exactly the issue we want to solve. LoadFunc will be passed PigSplit in prepareToRead, and we want user call pigSplit.getWrappedSplit() to get split specific information. The problem in current Pig is pigSplit.getWrappedSplit() always get #0 split. So we have this Jira to fix it.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185305#comment-13185305 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

You will need to write an inputformat, a loadfunc, and use PigUnit to invoke this loadfunc. Unfortunately I cannot find a sample with a custom inputformat in existing tests.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: split_fix_take2.patch, splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2462) getWrappedSplit is incorrectly returning the first split instead of the current split.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183528#comment-13183528 ] 

Daniel Dai commented on PIG-2462:
---------------------------------

We need to test. But I feel get the right split instead 0 should be the right way. Just browse through the code, I didn't find anything wrong with the change.
                
> getWrappedSplit is incorrectly returning the first split instead of the current split.
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2462
>                 URL: https://issues.apache.org/jira/browse/PIG-2462
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.11
>            Reporter: Alex Rovner
>             Fix For: 0.11
>
>         Attachments: splitsfix.patch
>
>
> If your loader needs information regarding what file is currently is being read (lets say for schema information), currently provides this ability by calling prepareToRead every time we read a new split. This is critical for ComibinedInputFormat as each mapper can read more then one file. In order for the load function to know what file we are currently reading, it should call getWrappedSplit() to get that information. How ever, getWrappedSplit always returns the first split in the list. Code from PigSplit.java:
>     /**
>      * This methods returns the actual InputSplit (as returned by the 
>      * {@link InputFormat}) which this class is wrapping.
>      * @return the wrappedSplit
>      */
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
> Furthermore, in PigRecordReader.java the splitIndex is never incremented when changing from split to split. So in fact, even if getWrappedSplit() wold be changed to return wrappedSplits[splitIndex]; it would still return the incorrect index. 
> This can be fixed by changing PigRecordReader to increment PigSplit.splitIndex everytime the split chagnes in the following code:
>     /**
>      * Get the record reader for the next chunk in this CombineFileSplit.
>      */
>     protected boolean initNextRecordReader() throws IOException, InterruptedException {
>         if (curReader != null) {
>             curReader.close();
>             curReader = null;
>             if (idx > 0) {
>                 progress += pigSplit.getLength(idx-1);    // done processing so far
>             }
>         }
>         // if all chunks have been processed, nothing more to do.
>         if (idx == pigSplit.getNumPaths()) {
>             return false;
>         }
>         // get a record reader for the idx-th chunk
>         try {
>           
>             curReader =  inputformat.createRecordReader(pigSplit.getWrappedSplit(idx), context);
>             LOG.info("Current split being processed "+pigSplit.getWrappedSplit(idx));
>             if (idx > 0) {
>                 // initialize() for the first RecordReader will be called by MapTask;
>                 // we're responsible for initializing subsequent RecordReaders.
>                 curReader.initialize(pigSplit.getWrappedSplit(idx), context);
>                 pigSplit.get
>                 loadfunc.prepareToRead(curReader, pigSplit);
>             }
>         } catch (Exception e) {
>             throw new RuntimeException (e);
>         }
>         idx++;
>         return true;
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira