You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alex Rovner <al...@gmail.com> on 2012/01/06 22:50:30 UTC

getWrappedSplit() is incorrectly returning the first split

Ran into this today. Using trunk (0.11)

If you are using a custom loader and are trying to get input split
information In prepareToRead(), getWrappedSplit() is providing the fist
split instead of current.

Checking the code confirms the suspicion:

PigSplit.java:

    public InputSplit getWrappedSplit() {
        return wrappedSplits[0];
    }

Should be:
    public InputSplit getWrappedSplit() {
        return wrappedSplits[splitIndex];
    }


The side effect is that if you are trying to retrieve the current split
when pig is using CombinedInputFormat it incorrectly always returns the
first file in the list instead of the current one that its reading. I have
also confirmed it by outputing a log statement in the prepareToRead():

    @Override
    public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
reader, PigSplit split)
            throws IOException {
        String path =
((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
        partitions = getPartitions(table, path);
        log.info("Preparing to read: " + path);
        this.reader = reader;
    }

2012-01-06 16:27:24,165 INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
Current split being processed
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+6187085
2012-01-06 16:27:24,180 INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
library
2012-01-06 16:27:24,183 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo library [hadoop-lzo rev
2dd49ec41018ba4141b20edf28dbb43c0c07f373]
2012-01-06 16:27:24,189 INFO
com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
read: hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
2012-01-06 16:27:28,053 INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
Current split being processed
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+6181475
2012-01-06 16:27:28,056 INFO
com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
read: hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005


Notice how the pig is correctly reporting the split but my "info" statement
is always reporting the first input split vs current.

Bug? Jira? Patch?

Thanks
Alex R

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Alex Rovner <al...@gmail.com>.

Aniket,

Thanks for pointing out the related Jira item. I do not see any harm
returning the correct splitIndex to the user instead of [0]. In case the
combined input is disabled it will always be 0. In case it's enabled it
should return the correct index(which it does not currently but will do so
once my patch is applied). I have attached the patch to the Jira. I would
appreciate it if some one can apply it and run the full suite of tests on
it. I have ran as many tests as I can on my end.

Thanks
Alex

On Tue, Jan 10, 2012 at 1:54 AM, Aniket Mokashi <an...@gmail.com> wrote:

> The change was added as part of PIG-1518. It has release notes-
>
> "This change will not cause any backward compatibility issue except if a
> loader implementation makes use of the PigSplit object passed through the
> prepareToRead method where a rebuild of the loader might be necessary as
> PigSplit's definition has been modified. However, currently we know of no
> external use of the object.
>
> This change also requires the loader to be stateless across the invocations
> to the prepareToRead method. That is, the method should reset any internal
> states that are not affected by the RecordReader argument.
> Otherwise, this feature should be disabled.
>
> It looks like returning 0th split was done deliberately. Comments?
>
> Thanks,
> Aniket
>
> On Mon, Jan 9, 2012 at 9:10 PM, Alex Rovner <al...@gmail.com> wrote:
>
> > I have already created the patch and tested with some of my jobs. I ran
> > into unit tests failure issues though as well. I can attach the patch to
> > Jira tomorrow anyways to be applied once things are straightened out.
> >
> > Alex R
> >
> > On Mon, Jan 9, 2012 at 8:07 PM, Jonathan Coveney <jc...@gmail.com>
> > wrote:
> >
> > > If it is affecting production jobs, I see no reason why we can't put
> the
> > > fix into 0.9.2, though I sense that a vote will be coming soon for a
> > 0.9.2
> > > release, so a fix would have to come soon..the issues running the tests
> > > brought up in Bill's thread will have to be fixed before we can,
> though.
> > I
> > > have a patch that's completely stopped because I can develop any new
> > tests,
> > > and so on.
> > >
> > > 2012/1/9 Prashant Kommireddi <pr...@gmail.com>
> > >
> > > > Is this critical enough to make it back into 0.9.1?
> > > >
> > > > -Prashant
> > > >
> > > > On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks so much for finding this out.
> > > > >
> > > > > I was using
> > > > >
> > > > > @Override
> > > > >
> > > > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > > RecordReaderreader, PigSplit split)
> > > > >
> > > > >  throws IOException {
> > > > >
> > > > >  this.in = reader;
> > > > >
> > > > >  partValues =
> > > > >
> > > > >
> > > >
> > >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> > > > >
> > > > >
> > > > > in my loader that behaves like hcatalog for delimited text in hive.
> > > That
> > > > > returns me same partvalues for all the values. I hacked it with
> > > something
> > > > > else. But, I think I must have hit this case. I will confirm.
> Thanks
> > > > again
> > > > > for reporting this.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Aniket
> > > > >
> > > > > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <daijy@hortonworks.com
> >
> > > > wrote:
> > > > >
> > > > > > Yes, please. Thanks!
> > > > > >
> > > > > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <
> alexrovner@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Jira opened.
> > > > > > >
> > > > > > > I can attempt to submit a patch as this seems like a fairly
> > > straight
> > > > > > > forward fix.
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/PIG-2462
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > > Alex R
> > > > > > >
> > > > > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <
> > daijy@hortonworks.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Sounds like a bug. I guess no one ever rely on specific split
> > > info
> > > > > > > before.
> > > > > > > > Please open a Jira.
> > > > > > > >
> > > > > > > > Daniel
> > > > > > > >
> > > > > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <
> > > alexrovner@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Additionally it looks like PigRecordReader is not
> > incrementing
> > > > the
> > > > > > > index
> > > > > > > > in
> > > > > > > > > the PigSplit when dealing with CombinedInputFormat thus the
> > > index
> > > > > > will
> > > > > > > be
> > > > > > > > > incorrect in either case.
> > > > > > > > >
> > > > > > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <
> > > > alexrovner@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Ran into this today. Using trunk (0.11)
> > > > > > > > > >
> > > > > > > > > > If you are using a custom loader and are trying to get
> > input
> > > > > split
> > > > > > > > > > information In prepareToRead(), getWrappedSplit() is
> > > providing
> > > > > the
> > > > > > > fist
> > > > > > > > > > split instead of current.
> > > > > > > > > >
> > > > > > > > > > Checking the code confirms the suspicion:
> > > > > > > > > >
> > > > > > > > > > PigSplit.java:
> > > > > > > > > >
> > > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > > >         return wrappedSplits[0];
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > > Should be:
> > > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > > >         return wrappedSplits[splitIndex];
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The side effect is that if you are trying to retrieve the
> > > > current
> > > > > > > split
> > > > > > > > > > when pig is using CombinedInputFormat it incorrectly
> always
> > > > > returns
> > > > > > > the
> > > > > > > > > > first file in the list instead of the current one that
> its
> > > > > > reading. I
> > > > > > > > > have
> > > > > > > > > > also confirmed it by outputing a log statement in the
> > > > > > > prepareToRead():
> > > > > > > > > >
> > > > > > > > > >     @Override
> > > > > > > > > >     public void
> prepareToRead(@SuppressWarnings("rawtypes")
> > > > > > > > RecordReader
> > > > > > > > > > reader, PigSplit split)
> > > > > > > > > >             throws IOException {
> > > > > > > > > >         String path =
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > > > > > >         partitions = getPartitions(table, path);
> > > > > > > > > >         log.info("Preparing to read: " + path);
> > > > > > > > > >         this.reader = reader;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > > 2012-01-06 16:27:24,165 INFO
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > > Current split being processed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > > > > > 16:27:24,180 INFO
> > > com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > > > > > Loaded
> > > > > > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > > > > > initialized
> > > > > > > > > native-lzo library [hadoop-lzo rev
> > > > > > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06
> > > 16:27:24,189
> > > > > INFO
> > > > > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > Preparing
> > > > to
> > > > > > > read:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > > > > > 16:27:28,053 INFO
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > > Current split being processed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > > > > > 16:27:28,056 INFO
> > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > > > > > Preparing to read:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Notice how the pig is correctly reporting the split but
> my
> > > > "info"
> > > > > > > > > > statement is always reporting the first input split vs
> > > current.
> > > > > > > > > >
> > > > > > > > > > Bug? Jira? Patch?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Alex R
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > "...:::Aniket:::... Quetzalco@tl"
> > > > >
> > > >
> > >
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Daniel Dai <da...@hortonworks.com>.

Thanks Aniket to point out 1518.

I don't totally get the meaning of it, but after 1518,
LoadFunc.prepareToRead will be invoked several times, each time on a
different split. If one makes assumption that prepareToRead only be called
once on each LoadFunc, it might become wrong. 1518 also changes the
definition of PigSplit. If LoadFunc.prepareToRead makes use of PigSplit, it
might become wrong.

Daniel

On Mon, Jan 9, 2012 at 10:54 PM, Aniket Mokashi <an...@gmail.com> wrote:

> The change was added as part of PIG-1518. It has release notes-
>
> "This change will not cause any backward compatibility issue except if a
> loader implementation makes use of the PigSplit object passed through the
> prepareToRead method where a rebuild of the loader might be necessary as
> PigSplit's definition has been modified. However, currently we know of no
> external use of the object.
>
> This change also requires the loader to be stateless across the invocations
> to the prepareToRead method. That is, the method should reset any internal
> states that are not affected by the RecordReader argument.
> Otherwise, this feature should be disabled.
>
> It looks like returning 0th split was done deliberately. Comments?
>
> Thanks,
> Aniket
>
> On Mon, Jan 9, 2012 at 9:10 PM, Alex Rovner <al...@gmail.com> wrote:
>
> > I have already created the patch and tested with some of my jobs. I ran
> > into unit tests failure issues though as well. I can attach the patch to
> > Jira tomorrow anyways to be applied once things are straightened out.
> >
> > Alex R
> >
> > On Mon, Jan 9, 2012 at 8:07 PM, Jonathan Coveney <jc...@gmail.com>
> > wrote:
> >
> > > If it is affecting production jobs, I see no reason why we can't put
> the
> > > fix into 0.9.2, though I sense that a vote will be coming soon for a
> > 0.9.2
> > > release, so a fix would have to come soon..the issues running the tests
> > > brought up in Bill's thread will have to be fixed before we can,
> though.
> > I
> > > have a patch that's completely stopped because I can develop any new
> > tests,
> > > and so on.
> > >
> > > 2012/1/9 Prashant Kommireddi <pr...@gmail.com>
> > >
> > > > Is this critical enough to make it back into 0.9.1?
> > > >
> > > > -Prashant
> > > >
> > > > On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks so much for finding this out.
> > > > >
> > > > > I was using
> > > > >
> > > > > @Override
> > > > >
> > > > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > > RecordReaderreader, PigSplit split)
> > > > >
> > > > >  throws IOException {
> > > > >
> > > > >  this.in = reader;
> > > > >
> > > > >  partValues =
> > > > >
> > > > >
> > > >
> > >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> > > > >
> > > > >
> > > > > in my loader that behaves like hcatalog for delimited text in hive.
> > > That
> > > > > returns me same partvalues for all the values. I hacked it with
> > > something
> > > > > else. But, I think I must have hit this case. I will confirm.
> Thanks
> > > > again
> > > > > for reporting this.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Aniket
> > > > >
> > > > > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <daijy@hortonworks.com
> >
> > > > wrote:
> > > > >
> > > > > > Yes, please. Thanks!
> > > > > >
> > > > > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <
> alexrovner@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Jira opened.
> > > > > > >
> > > > > > > I can attempt to submit a patch as this seems like a fairly
> > > straight
> > > > > > > forward fix.
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/PIG-2462
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > > Alex R
> > > > > > >
> > > > > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <
> > daijy@hortonworks.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Sounds like a bug. I guess no one ever rely on specific split
> > > info
> > > > > > > before.
> > > > > > > > Please open a Jira.
> > > > > > > >
> > > > > > > > Daniel
> > > > > > > >
> > > > > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <
> > > alexrovner@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Additionally it looks like PigRecordReader is not
> > incrementing
> > > > the
> > > > > > > index
> > > > > > > > in
> > > > > > > > > the PigSplit when dealing with CombinedInputFormat thus the
> > > index
> > > > > > will
> > > > > > > be
> > > > > > > > > incorrect in either case.
> > > > > > > > >
> > > > > > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <
> > > > alexrovner@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Ran into this today. Using trunk (0.11)
> > > > > > > > > >
> > > > > > > > > > If you are using a custom loader and are trying to get
> > input
> > > > > split
> > > > > > > > > > information In prepareToRead(), getWrappedSplit() is
> > > providing
> > > > > the
> > > > > > > fist
> > > > > > > > > > split instead of current.
> > > > > > > > > >
> > > > > > > > > > Checking the code confirms the suspicion:
> > > > > > > > > >
> > > > > > > > > > PigSplit.java:
> > > > > > > > > >
> > > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > > >         return wrappedSplits[0];
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > > Should be:
> > > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > > >         return wrappedSplits[splitIndex];
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The side effect is that if you are trying to retrieve the
> > > > current
> > > > > > > split
> > > > > > > > > > when pig is using CombinedInputFormat it incorrectly
> always
> > > > > returns
> > > > > > > the
> > > > > > > > > > first file in the list instead of the current one that
> its
> > > > > > reading. I
> > > > > > > > > have
> > > > > > > > > > also confirmed it by outputing a log statement in the
> > > > > > > prepareToRead():
> > > > > > > > > >
> > > > > > > > > >     @Override
> > > > > > > > > >     public void
> prepareToRead(@SuppressWarnings("rawtypes")
> > > > > > > > RecordReader
> > > > > > > > > > reader, PigSplit split)
> > > > > > > > > >             throws IOException {
> > > > > > > > > >         String path =
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > > > > > >         partitions = getPartitions(table, path);
> > > > > > > > > >         log.info("Preparing to read: " + path);
> > > > > > > > > >         this.reader = reader;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > > 2012-01-06 16:27:24,165 INFO
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > > Current split being processed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > > > > > 16:27:24,180 INFO
> > > com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > > > > > Loaded
> > > > > > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > > > > > initialized
> > > > > > > > > native-lzo library [hadoop-lzo rev
> > > > > > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06
> > > 16:27:24,189
> > > > > INFO
> > > > > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > Preparing
> > > > to
> > > > > > > read:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > > > > > 16:27:28,053 INFO
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > > Current split being processed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > > > > > 16:27:28,056 INFO
> > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > > > > > Preparing to read:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Notice how the pig is correctly reporting the split but
> my
> > > > "info"
> > > > > > > > > > statement is always reporting the first input split vs
> > > current.
> > > > > > > > > >
> > > > > > > > > > Bug? Jira? Patch?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Alex R
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > "...:::Aniket:::... Quetzalco@tl"
> > > > >
> > > >
> > >
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Aniket Mokashi <an...@gmail.com>.

The change was added as part of PIG-1518. It has release notes-

"This change will not cause any backward compatibility issue except if a
loader implementation makes use of the PigSplit object passed through the
prepareToRead method where a rebuild of the loader might be necessary as
PigSplit's definition has been modified. However, currently we know of no
external use of the object.

This change also requires the loader to be stateless across the invocations
to the prepareToRead method. That is, the method should reset any internal
states that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

It looks like returning 0th split was done deliberately. Comments?

Thanks,
Aniket

On Mon, Jan 9, 2012 at 9:10 PM, Alex Rovner <al...@gmail.com> wrote:

> I have already created the patch and tested with some of my jobs. I ran
> into unit tests failure issues though as well. I can attach the patch to
> Jira tomorrow anyways to be applied once things are straightened out.
>
> Alex R
>
> On Mon, Jan 9, 2012 at 8:07 PM, Jonathan Coveney <jc...@gmail.com>
> wrote:
>
> > If it is affecting production jobs, I see no reason why we can't put the
> > fix into 0.9.2, though I sense that a vote will be coming soon for a
> 0.9.2
> > release, so a fix would have to come soon..the issues running the tests
> > brought up in Bill's thread will have to be fixed before we can, though.
> I
> > have a patch that's completely stopped because I can develop any new
> tests,
> > and so on.
> >
> > 2012/1/9 Prashant Kommireddi <pr...@gmail.com>
> >
> > > Is this critical enough to make it back into 0.9.1?
> > >
> > > -Prashant
> > >
> > > On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com>
> > > wrote:
> > >
> > > > Thanks so much for finding this out.
> > > >
> > > > I was using
> > > >
> > > > @Override
> > > >
> > > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > RecordReaderreader, PigSplit split)
> > > >
> > > >  throws IOException {
> > > >
> > > >  this.in = reader;
> > > >
> > > >  partValues =
> > > >
> > > >
> > >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> > > >
> > > >
> > > > in my loader that behaves like hcatalog for delimited text in hive.
> > That
> > > > returns me same partvalues for all the values. I hacked it with
> > something
> > > > else. But, I think I must have hit this case. I will confirm. Thanks
> > > again
> > > > for reporting this.
> > > >
> > > > Thanks,
> > > >
> > > > Aniket
> > > >
> > > > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <da...@hortonworks.com>
> > > wrote:
> > > >
> > > > > Yes, please. Thanks!
> > > > >
> > > > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <alexrovner@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Jira opened.
> > > > > >
> > > > > > I can attempt to submit a patch as this seems like a fairly
> > straight
> > > > > > forward fix.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/PIG-2462
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Alex R
> > > > > >
> > > > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <
> daijy@hortonworks.com>
> > > > > wrote:
> > > > > >
> > > > > > > Sounds like a bug. I guess no one ever rely on specific split
> > info
> > > > > > before.
> > > > > > > Please open a Jira.
> > > > > > >
> > > > > > > Daniel
> > > > > > >
> > > > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <
> > alexrovner@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Additionally it looks like PigRecordReader is not
> incrementing
> > > the
> > > > > > index
> > > > > > > in
> > > > > > > > the PigSplit when dealing with CombinedInputFormat thus the
> > index
> > > > > will
> > > > > > be
> > > > > > > > incorrect in either case.
> > > > > > > >
> > > > > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <
> > > alexrovner@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Ran into this today. Using trunk (0.11)
> > > > > > > > >
> > > > > > > > > If you are using a custom loader and are trying to get
> input
> > > > split
> > > > > > > > > information In prepareToRead(), getWrappedSplit() is
> > providing
> > > > the
> > > > > > fist
> > > > > > > > > split instead of current.
> > > > > > > > >
> > > > > > > > > Checking the code confirms the suspicion:
> > > > > > > > >
> > > > > > > > > PigSplit.java:
> > > > > > > > >
> > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > >         return wrappedSplits[0];
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > > Should be:
> > > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > > >         return wrappedSplits[splitIndex];
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The side effect is that if you are trying to retrieve the
> > > current
> > > > > > split
> > > > > > > > > when pig is using CombinedInputFormat it incorrectly always
> > > > returns
> > > > > > the
> > > > > > > > > first file in the list instead of the current one that its
> > > > > reading. I
> > > > > > > > have
> > > > > > > > > also confirmed it by outputing a log statement in the
> > > > > > prepareToRead():
> > > > > > > > >
> > > > > > > > >     @Override
> > > > > > > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > > > > RecordReader
> > > > > > > > > reader, PigSplit split)
> > > > > > > > >             throws IOException {
> > > > > > > > >         String path =
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > > > > >         partitions = getPartitions(table, path);
> > > > > > > > >         log.info("Preparing to read: " + path);
> > > > > > > > >         this.reader = reader;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > > 2012-01-06 16:27:24,165 INFO
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > Current split being processed
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > > > > 16:27:24,180 INFO
> > com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > > > > Loaded
> > > > > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > > > > initialized
> > > > > > > > native-lzo library [hadoop-lzo rev
> > > > > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06
> > 16:27:24,189
> > > > INFO
> > > > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> Preparing
> > > to
> > > > > > read:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > > > > 16:27:28,053 INFO
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > > Current split being processed
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > > > > 16:27:28,056 INFO
> > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > > > > Preparing to read:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Notice how the pig is correctly reporting the split but my
> > > "info"
> > > > > > > > > statement is always reporting the first input split vs
> > current.
> > > > > > > > >
> > > > > > > > > Bug? Jira? Patch?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Alex R
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > "...:::Aniket:::... Quetzalco@tl"
> > > >
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Alex Rovner <al...@gmail.com>.

I have already created the patch and tested with some of my jobs. I ran
into unit tests failure issues though as well. I can attach the patch to
Jira tomorrow anyways to be applied once things are straightened out.

Alex R

On Mon, Jan 9, 2012 at 8:07 PM, Jonathan Coveney <jc...@gmail.com> wrote:

> If it is affecting production jobs, I see no reason why we can't put the
> fix into 0.9.2, though I sense that a vote will be coming soon for a 0.9.2
> release, so a fix would have to come soon..the issues running the tests
> brought up in Bill's thread will have to be fixed before we can, though. I
> have a patch that's completely stopped because I can develop any new tests,
> and so on.
>
> 2012/1/9 Prashant Kommireddi <pr...@gmail.com>
>
> > Is this critical enough to make it back into 0.9.1?
> >
> > -Prashant
> >
> > On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com>
> > wrote:
> >
> > > Thanks so much for finding this out.
> > >
> > > I was using
> > >
> > > @Override
> > >
> > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > RecordReaderreader, PigSplit split)
> > >
> > >  throws IOException {
> > >
> > >  this.in = reader;
> > >
> > >  partValues =
> > >
> > >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> > >
> > >
> > > in my loader that behaves like hcatalog for delimited text in hive.
> That
> > > returns me same partvalues for all the values. I hacked it with
> something
> > > else. But, I think I must have hit this case. I will confirm. Thanks
> > again
> > > for reporting this.
> > >
> > > Thanks,
> > >
> > > Aniket
> > >
> > > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <da...@hortonworks.com>
> > wrote:
> > >
> > > > Yes, please. Thanks!
> > > >
> > > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <al...@gmail.com>
> > > wrote:
> > > >
> > > > > Jira opened.
> > > > >
> > > > > I can attempt to submit a patch as this seems like a fairly
> straight
> > > > > forward fix.
> > > > >
> > > > > https://issues.apache.org/jira/browse/PIG-2462
> > > > >
> > > > >
> > > > > Thanks
> > > > > Alex R
> > > > >
> > > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com>
> > > > wrote:
> > > > >
> > > > > > Sounds like a bug. I guess no one ever rely on specific split
> info
> > > > > before.
> > > > > > Please open a Jira.
> > > > > >
> > > > > > Daniel
> > > > > >
> > > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <
> alexrovner@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Additionally it looks like PigRecordReader is not incrementing
> > the
> > > > > index
> > > > > > in
> > > > > > > the PigSplit when dealing with CombinedInputFormat thus the
> index
> > > > will
> > > > > be
> > > > > > > incorrect in either case.
> > > > > > >
> > > > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <
> > alexrovner@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Ran into this today. Using trunk (0.11)
> > > > > > > >
> > > > > > > > If you are using a custom loader and are trying to get input
> > > split
> > > > > > > > information In prepareToRead(), getWrappedSplit() is
> providing
> > > the
> > > > > fist
> > > > > > > > split instead of current.
> > > > > > > >
> > > > > > > > Checking the code confirms the suspicion:
> > > > > > > >
> > > > > > > > PigSplit.java:
> > > > > > > >
> > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > >         return wrappedSplits[0];
> > > > > > > >     }
> > > > > > > >
> > > > > > > > Should be:
> > > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > > >         return wrappedSplits[splitIndex];
> > > > > > > >     }
> > > > > > > >
> > > > > > > >
> > > > > > > > The side effect is that if you are trying to retrieve the
> > current
> > > > > split
> > > > > > > > when pig is using CombinedInputFormat it incorrectly always
> > > returns
> > > > > the
> > > > > > > > first file in the list instead of the current one that its
> > > > reading. I
> > > > > > > have
> > > > > > > > also confirmed it by outputing a log statement in the
> > > > > prepareToRead():
> > > > > > > >
> > > > > > > >     @Override
> > > > > > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > > > RecordReader
> > > > > > > > reader, PigSplit split)
> > > > > > > >             throws IOException {
> > > > > > > >         String path =
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > > > >         partitions = getPartitions(table, path);
> > > > > > > >         log.info("Preparing to read: " + path);
> > > > > > > >         this.reader = reader;
> > > > > > > >     }
> > > > > > > >
> > > > > > > > 2012-01-06 16:27:24,165 INFO
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > Current split being processed
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > > > 16:27:24,180 INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > > > Loaded
> > > > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > > > initialized
> > > > > > > native-lzo library [hadoop-lzo rev
> > > > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06
> 16:27:24,189
> > > INFO
> > > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing
> > to
> > > > > read:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > > > 16:27:28,053 INFO
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > > Current split being processed
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > > > 16:27:28,056 INFO
> > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > > > Preparing to read:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > > > >
> > > > > > > >
> > > > > > > > Notice how the pig is correctly reporting the split but my
> > "info"
> > > > > > > > statement is always reporting the first input split vs
> current.
> > > > > > > >
> > > > > > > > Bug? Jira? Patch?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Alex R
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > "...:::Aniket:::... Quetzalco@tl"
> > >
> >
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Jonathan Coveney <jc...@gmail.com>.

If it is affecting production jobs, I see no reason why we can't put the
fix into 0.9.2, though I sense that a vote will be coming soon for a 0.9.2
release, so a fix would have to come soon..the issues running the tests
brought up in Bill's thread will have to be fixed before we can, though. I
have a patch that's completely stopped because I can develop any new tests,
and so on.

2012/1/9 Prashant Kommireddi <pr...@gmail.com>

> Is this critical enough to make it back into 0.9.1?
>
> -Prashant
>
> On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com>
> wrote:
>
> > Thanks so much for finding this out.
> >
> > I was using
> >
> > @Override
> >
> > public void prepareToRead(@SuppressWarnings("rawtypes")
> > RecordReaderreader, PigSplit split)
> >
> >  throws IOException {
> >
> >  this.in = reader;
> >
> >  partValues =
> >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> >
> >
> > in my loader that behaves like hcatalog for delimited text in hive. That
> > returns me same partvalues for all the values. I hacked it with something
> > else. But, I think I must have hit this case. I will confirm. Thanks
> again
> > for reporting this.
> >
> > Thanks,
> >
> > Aniket
> >
> > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> > > Yes, please. Thanks!
> > >
> > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <al...@gmail.com>
> > wrote:
> > >
> > > > Jira opened.
> > > >
> > > > I can attempt to submit a patch as this seems like a fairly straight
> > > > forward fix.
> > > >
> > > > https://issues.apache.org/jira/browse/PIG-2462
> > > >
> > > >
> > > > Thanks
> > > > Alex R
> > > >
> > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com>
> > > wrote:
> > > >
> > > > > Sounds like a bug. I guess no one ever rely on specific split info
> > > > before.
> > > > > Please open a Jira.
> > > > >
> > > > > Daniel
> > > > >
> > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <alexrovner@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Additionally it looks like PigRecordReader is not incrementing
> the
> > > > index
> > > > > in
> > > > > > the PigSplit when dealing with CombinedInputFormat thus the index
> > > will
> > > > be
> > > > > > incorrect in either case.
> > > > > >
> > > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <
> alexrovner@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Ran into this today. Using trunk (0.11)
> > > > > > >
> > > > > > > If you are using a custom loader and are trying to get input
> > split
> > > > > > > information In prepareToRead(), getWrappedSplit() is providing
> > the
> > > > fist
> > > > > > > split instead of current.
> > > > > > >
> > > > > > > Checking the code confirms the suspicion:
> > > > > > >
> > > > > > > PigSplit.java:
> > > > > > >
> > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > >         return wrappedSplits[0];
> > > > > > >     }
> > > > > > >
> > > > > > > Should be:
> > > > > > >     public InputSplit getWrappedSplit() {
> > > > > > >         return wrappedSplits[splitIndex];
> > > > > > >     }
> > > > > > >
> > > > > > >
> > > > > > > The side effect is that if you are trying to retrieve the
> current
> > > > split
> > > > > > > when pig is using CombinedInputFormat it incorrectly always
> > returns
> > > > the
> > > > > > > first file in the list instead of the current one that its
> > > reading. I
> > > > > > have
> > > > > > > also confirmed it by outputing a log statement in the
> > > > prepareToRead():
> > > > > > >
> > > > > > >     @Override
> > > > > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > > RecordReader
> > > > > > > reader, PigSplit split)
> > > > > > >             throws IOException {
> > > > > > >         String path =
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > > >         partitions = getPartitions(table, path);
> > > > > > >         log.info("Preparing to read: " + path);
> > > > > > >         this.reader = reader;
> > > > > > >     }
> > > > > > >
> > > > > > > 2012-01-06 16:27:24,165 INFO
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > Current split being processed
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > > Loaded
> > > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > > initialized
> > > > > > native-lzo library [hadoop-lzo rev
> > > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189
> > INFO
> > > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing
> to
> > > > read:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > > 16:27:28,053 INFO
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > > Current split being processed
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > > 16:27:28,056 INFO
> > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > > Preparing to read:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > > >
> > > > > > >
> > > > > > > Notice how the pig is correctly reporting the split but my
> "info"
> > > > > > > statement is always reporting the first input split vs current.
> > > > > > >
> > > > > > > Bug? Jira? Patch?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Alex R
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Prashant Kommireddi <pr...@gmail.com>.

Is this critical enough to make it back into 0.9.1?

-Prashant

On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi <an...@gmail.com> wrote:

> Thanks so much for finding this out.
>
> I was using
>
> @Override
>
> public void prepareToRead(@SuppressWarnings("rawtypes")
> RecordReaderreader, PigSplit split)
>
>  throws IOException {
>
>  this.in = reader;
>
>  partValues =
>
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
>
>
> in my loader that behaves like hcatalog for delimited text in hive. That
> returns me same partvalues for all the values. I hacked it with something
> else. But, I think I must have hit this case. I will confirm. Thanks again
> for reporting this.
>
> Thanks,
>
> Aniket
>
> On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <da...@hortonworks.com> wrote:
>
> > Yes, please. Thanks!
> >
> > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <al...@gmail.com>
> wrote:
> >
> > > Jira opened.
> > >
> > > I can attempt to submit a patch as this seems like a fairly straight
> > > forward fix.
> > >
> > > https://issues.apache.org/jira/browse/PIG-2462
> > >
> > >
> > > Thanks
> > > Alex R
> > >
> > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com>
> > wrote:
> > >
> > > > Sounds like a bug. I guess no one ever rely on specific split info
> > > before.
> > > > Please open a Jira.
> > > >
> > > > Daniel
> > > >
> > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <al...@gmail.com>
> > > wrote:
> > > >
> > > > > Additionally it looks like PigRecordReader is not incrementing the
> > > index
> > > > in
> > > > > the PigSplit when dealing with CombinedInputFormat thus the index
> > will
> > > be
> > > > > incorrect in either case.
> > > > >
> > > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Ran into this today. Using trunk (0.11)
> > > > > >
> > > > > > If you are using a custom loader and are trying to get input
> split
> > > > > > information In prepareToRead(), getWrappedSplit() is providing
> the
> > > fist
> > > > > > split instead of current.
> > > > > >
> > > > > > Checking the code confirms the suspicion:
> > > > > >
> > > > > > PigSplit.java:
> > > > > >
> > > > > >     public InputSplit getWrappedSplit() {
> > > > > >         return wrappedSplits[0];
> > > > > >     }
> > > > > >
> > > > > > Should be:
> > > > > >     public InputSplit getWrappedSplit() {
> > > > > >         return wrappedSplits[splitIndex];
> > > > > >     }
> > > > > >
> > > > > >
> > > > > > The side effect is that if you are trying to retrieve the current
> > > split
> > > > > > when pig is using CombinedInputFormat it incorrectly always
> returns
> > > the
> > > > > > first file in the list instead of the current one that its
> > reading. I
> > > > > have
> > > > > > also confirmed it by outputing a log statement in the
> > > prepareToRead():
> > > > > >
> > > > > >     @Override
> > > > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > RecordReader
> > > > > > reader, PigSplit split)
> > > > > >             throws IOException {
> > > > > >         String path =
> > > > > >
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > >         partitions = getPartitions(table, path);
> > > > > >         log.info("Preparing to read: " + path);
> > > > > >         this.reader = reader;
> > > > > >     }
> > > > > >
> > > > > > 2012-01-06 16:27:24,165 INFO
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > Current split being processed
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > > Loaded
> > > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> > initialized
> > > > > native-lzo library [hadoop-lzo rev
> > > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189
> INFO
> > > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
> > > read:
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > > 16:27:28,053 INFO
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > > Current split being processed
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > > 16:27:28,056 INFO
> > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > > Preparing to read:
> > > > >
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > > >
> > > > > >
> > > > > > Notice how the pig is correctly reporting the split but my "info"
> > > > > > statement is always reporting the first input split vs current.
> > > > > >
> > > > > > Bug? Jira? Patch?
> > > > > >
> > > > > > Thanks
> > > > > > Alex R
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Aniket Mokashi <an...@gmail.com>.

Thanks so much for finding this out.

I was using

@Override

public void prepareToRead(@SuppressWarnings("rawtypes")
RecordReaderreader, PigSplit split)

 throws IOException {

 this.in = reader;

 partValues =
((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();


in my loader that behaves like hcatalog for delimited text in hive. That
returns me same partvalues for all the values. I hacked it with something
else. But, I think I must have hit this case. I will confirm. Thanks again
for reporting this.

Thanks,

Aniket

On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai <da...@hortonworks.com> wrote:

> Yes, please. Thanks!
>
> On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <al...@gmail.com> wrote:
>
> > Jira opened.
> >
> > I can attempt to submit a patch as this seems like a fairly straight
> > forward fix.
> >
> > https://issues.apache.org/jira/browse/PIG-2462
> >
> >
> > Thanks
> > Alex R
> >
> > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> > > Sounds like a bug. I guess no one ever rely on specific split info
> > before.
> > > Please open a Jira.
> > >
> > > Daniel
> > >
> > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <al...@gmail.com>
> > wrote:
> > >
> > > > Additionally it looks like PigRecordReader is not incrementing the
> > index
> > > in
> > > > the PigSplit when dealing with CombinedInputFormat thus the index
> will
> > be
> > > > incorrect in either case.
> > > >
> > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com>
> > > wrote:
> > > >
> > > > > Ran into this today. Using trunk (0.11)
> > > > >
> > > > > If you are using a custom loader and are trying to get input split
> > > > > information In prepareToRead(), getWrappedSplit() is providing the
> > fist
> > > > > split instead of current.
> > > > >
> > > > > Checking the code confirms the suspicion:
> > > > >
> > > > > PigSplit.java:
> > > > >
> > > > >     public InputSplit getWrappedSplit() {
> > > > >         return wrappedSplits[0];
> > > > >     }
> > > > >
> > > > > Should be:
> > > > >     public InputSplit getWrappedSplit() {
> > > > >         return wrappedSplits[splitIndex];
> > > > >     }
> > > > >
> > > > >
> > > > > The side effect is that if you are trying to retrieve the current
> > split
> > > > > when pig is using CombinedInputFormat it incorrectly always returns
> > the
> > > > > first file in the list instead of the current one that its
> reading. I
> > > > have
> > > > > also confirmed it by outputing a log statement in the
> > prepareToRead():
> > > > >
> > > > >     @Override
> > > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > > RecordReader
> > > > > reader, PigSplit split)
> > > > >             throws IOException {
> > > > >         String path =
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > >         partitions = getPartitions(table, path);
> > > > >         log.info("Preparing to read: " + path);
> > > > >         this.reader = reader;
> > > > >     }
> > > > >
> > > > > 2012-01-06 16:27:24,165 INFO
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > Current split being processed
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > Loaded
> > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> initialized
> > > > native-lzo library [hadoop-lzo rev
> > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
> > read:
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > > 16:27:28,053 INFO
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > Current split being processed
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > > 16:27:28,056 INFO
> com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > Preparing to read:
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > > >
> > > > >
> > > > > Notice how the pig is correctly reporting the split but my "info"
> > > > > statement is always reporting the first input split vs current.
> > > > >
> > > > > Bug? Jira? Patch?
> > > > >
> > > > > Thanks
> > > > > Alex R
> > > > >
> > > >
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Daniel Dai <da...@hortonworks.com>.

Yes, please. Thanks!

On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner <al...@gmail.com> wrote:

> Jira opened.
>
> I can attempt to submit a patch as this seems like a fairly straight
> forward fix.
>
> https://issues.apache.org/jira/browse/PIG-2462
>
>
> Thanks
> Alex R
>
> On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com> wrote:
>
> > Sounds like a bug. I guess no one ever rely on specific split info
> before.
> > Please open a Jira.
> >
> > Daniel
> >
> > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <al...@gmail.com>
> wrote:
> >
> > > Additionally it looks like PigRecordReader is not incrementing the
> index
> > in
> > > the PigSplit when dealing with CombinedInputFormat thus the index will
> be
> > > incorrect in either case.
> > >
> > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com>
> > wrote:
> > >
> > > > Ran into this today. Using trunk (0.11)
> > > >
> > > > If you are using a custom loader and are trying to get input split
> > > > information In prepareToRead(), getWrappedSplit() is providing the
> fist
> > > > split instead of current.
> > > >
> > > > Checking the code confirms the suspicion:
> > > >
> > > > PigSplit.java:
> > > >
> > > >     public InputSplit getWrappedSplit() {
> > > >         return wrappedSplits[0];
> > > >     }
> > > >
> > > > Should be:
> > > >     public InputSplit getWrappedSplit() {
> > > >         return wrappedSplits[splitIndex];
> > > >     }
> > > >
> > > >
> > > > The side effect is that if you are trying to retrieve the current
> split
> > > > when pig is using CombinedInputFormat it incorrectly always returns
> the
> > > > first file in the list instead of the current one that its reading. I
> > > have
> > > > also confirmed it by outputing a log statement in the
> prepareToRead():
> > > >
> > > >     @Override
> > > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> > RecordReader
> > > > reader, PigSplit split)
> > > >             throws IOException {
> > > >         String path =
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > >         partitions = getPartitions(table, path);
> > > >         log.info("Preparing to read: " + path);
> > > >         this.reader = reader;
> > > >     }
> > > >
> > > > 2012-01-06 16:27:24,165 INFO
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > Current split being processed
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
> Loaded
> > > native gpl library2012-01-06 16:27:24,183 INFO
> > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized
> > > native-lzo library [hadoop-lzo rev
> > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
> read:
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > > 16:27:28,053 INFO
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > Current split being processed
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > > 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > Preparing to read:
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > > >
> > > >
> > > > Notice how the pig is correctly reporting the split but my "info"
> > > > statement is always reporting the first input split vs current.
> > > >
> > > > Bug? Jira? Patch?
> > > >
> > > > Thanks
> > > > Alex R
> > > >
> > >
> >
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Alex Rovner <al...@gmail.com>.

Jira opened.

I can attempt to submit a patch as this seems like a fairly straight
forward fix.

https://issues.apache.org/jira/browse/PIG-2462


Thanks
Alex R

On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <da...@hortonworks.com> wrote:

> Sounds like a bug. I guess no one ever rely on specific split info before.
> Please open a Jira.
>
> Daniel
>
> On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <al...@gmail.com> wrote:
>
> > Additionally it looks like PigRecordReader is not incrementing the index
> in
> > the PigSplit when dealing with CombinedInputFormat thus the index will be
> > incorrect in either case.
> >
> > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com>
> wrote:
> >
> > > Ran into this today. Using trunk (0.11)
> > >
> > > If you are using a custom loader and are trying to get input split
> > > information In prepareToRead(), getWrappedSplit() is providing the fist
> > > split instead of current.
> > >
> > > Checking the code confirms the suspicion:
> > >
> > > PigSplit.java:
> > >
> > >     public InputSplit getWrappedSplit() {
> > >         return wrappedSplits[0];
> > >     }
> > >
> > > Should be:
> > >     public InputSplit getWrappedSplit() {
> > >         return wrappedSplits[splitIndex];
> > >     }
> > >
> > >
> > > The side effect is that if you are trying to retrieve the current split
> > > when pig is using CombinedInputFormat it incorrectly always returns the
> > > first file in the list instead of the current one that its reading. I
> > have
> > > also confirmed it by outputing a log statement in the prepareToRead():
> > >
> > >     @Override
> > >     public void prepareToRead(@SuppressWarnings("rawtypes")
> RecordReader
> > > reader, PigSplit split)
> > >             throws IOException {
> > >         String path =
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > >         partitions = getPartitions(table, path);
> > >         log.info("Preparing to read: " + path);
> > >         this.reader = reader;
> > >     }
> > >
> > > 2012-01-06 16:27:24,165 INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > Current split being processed
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded
> > native gpl library2012-01-06 16:27:24,183 INFO
> > com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized
> > native-lzo library [hadoop-lzo rev
> > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read:
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> > 16:27:28,053 INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > Current split being processed
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> > 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > Preparing to read:
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> > >
> > >
> > > Notice how the pig is correctly reporting the split but my "info"
> > > statement is always reporting the first input split vs current.
> > >
> > > Bug? Jira? Patch?
> > >
> > > Thanks
> > > Alex R
> > >
> >
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Daniel Dai <da...@hortonworks.com>.

Sounds like a bug. I guess no one ever rely on specific split info before.
Please open a Jira.

Daniel

On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <al...@gmail.com> wrote:

> Additionally it looks like PigRecordReader is not incrementing the index in
> the PigSplit when dealing with CombinedInputFormat thus the index will be
> incorrect in either case.
>
> On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com> wrote:
>
> > Ran into this today. Using trunk (0.11)
> >
> > If you are using a custom loader and are trying to get input split
> > information In prepareToRead(), getWrappedSplit() is providing the fist
> > split instead of current.
> >
> > Checking the code confirms the suspicion:
> >
> > PigSplit.java:
> >
> >     public InputSplit getWrappedSplit() {
> >         return wrappedSplits[0];
> >     }
> >
> > Should be:
> >     public InputSplit getWrappedSplit() {
> >         return wrappedSplits[splitIndex];
> >     }
> >
> >
> > The side effect is that if you are trying to retrieve the current split
> > when pig is using CombinedInputFormat it incorrectly always returns the
> > first file in the list instead of the current one that its reading. I
> have
> > also confirmed it by outputing a log statement in the prepareToRead():
> >
> >     @Override
> >     public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> > reader, PigSplit split)
> >             throws IOException {
> >         String path =
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> >         partitions = getPartitions(table, path);
> >         log.info("Preparing to read: " + path);
> >         this.reader = reader;
> >     }
> >
> > 2012-01-06 16:27:24,165 INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> Current split being processed
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded
> native gpl library2012-01-06 16:27:24,183 INFO
> com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read:
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> 16:27:28,053 INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> Current split being processed
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> Preparing to read:
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> >
> >
> > Notice how the pig is correctly reporting the split but my "info"
> > statement is always reporting the first input split vs current.
> >
> > Bug? Jira? Patch?
> >
> > Thanks
> > Alex R
> >
>

Re: getWrappedSplit() is incorrectly returning the first split

Posted by Alex Rovner <al...@gmail.com>.

Additionally it looks like PigRecordReader is not incrementing the index in
the PigSplit when dealing with CombinedInputFormat thus the index will be
incorrect in either case.

On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <al...@gmail.com> wrote:

> Ran into this today. Using trunk (0.11)
>
> If you are using a custom loader and are trying to get input split
> information In prepareToRead(), getWrappedSplit() is providing the fist
> split instead of current.
>
> Checking the code confirms the suspicion:
>
> PigSplit.java:
>
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
>
> Should be:
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[splitIndex];
>     }
>
>
> The side effect is that if you are trying to retrieve the current split
> when pig is using CombinedInputFormat it incorrectly always returns the
> first file in the list instead of the current one that its reading. I have
> also confirmed it by outputing a log statement in the prepareToRead():
>
>     @Override
>     public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> reader, PigSplit split)
>             throws IOException {
>         String path =
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
>         partitions = getPartitions(table, path);
>         log.info("Preparing to read: " + path);
>         this.reader = reader;
>     }
>
> 2012-01-06 16:27:24,165 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: Current split being processed hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library2012-01-06 16:27:24,183 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read: hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06 16:27:28,053 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: Current split being processed hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read: hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
>
>
> Notice how the pig is correctly reporting the split but my "info"
> statement is always reporting the first input split vs current.
>
> Bug? Jira? Patch?
>
> Thanks
> Alex R
>