You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Michael Dalton <mw...@gmail.com> on 2010/04/07 09:45:47 UTC

Re: Bug in nested foreach with ORDER after grouping with multiple keys

I have identified the source of the bug: the secondary key optimizations
introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true
then you get the correct result. I will try to get a patch together.

Best regards,

Mike

On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <mw...@gmail.com> wrote:

> Hi,
>
> I've hit a somewhat obscure bug in the scripts I'm writing caused by the
> combination of a few factors: multiple column groups, PARALLEL > 1 for
> grouping, and a nested for-each body following the group that sorts using
> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing
> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end
> result is that the final GROUP/ORDER occurs with the incorrect group key,
> causing incorrect output.
>
> I have a tiny input file that generates this behavior:
> http://pastebin.com/UQZkug8Y
> <http://pastebin.com/UQZkug8Y>
> Here is a script showing the behavior in question:
>   log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int,
> email:chararray, subject:chararray, msgid:long);
>   group_email = GROUP log BY (userid, email) PARALLEL 10;
>   email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS
> count, group.email;
>   group_user = GROUP email_count BY userid PARALLEL 10;
>   top_for_user = FOREACH group_user {
>     sorted_count = ORDER email_count BY count DESC;
>     GENERATE group, sorted_count;
>   }
>   DUMP top_for_user;
>
> The expected output here should be that each (userid, sorted_list) pair
> should occur once, with the list sorted in descending order by count.
> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear
> for the same userid. Interestingly enough, each one of the 'count' fields is
> correct. If I had to hazard a guess, perhaps the composite key (userid,
> email) from the first GROUP operation is being re-used or multiple
> operations are being pushed into the same reducer despite requiring a
> different ordering/grouping.
>
> Here is the (incorrect) output from the above script:
>  (100,{(100,1L,b@hotmail.com),(100,1L,d@hotmail.com),(100,1L,
> abc123@hotmail.com),(100,1L,test123@hotmail.com),(100,1L,d@hotmail.co
> ),(100,1L,there@foo.com)})
> (100,{(100,2L,hello@foo.com)})
> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
> jaksld@jklaf.com)})
>
> Note how there are two entries for userid 100, which should be
> impossible. Here is the output if I change GROUP email_count BY userid
> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected
> result:
> (100,{(100,2L,hello@foo.com),(100,1L,b@hotmail.com),(100,1L,there@foo.com
> ),(100,1L,d@hotmail.com),(100,1L,abc123@hotmail.com),(100,1L,d@hotmail.co
> ),(100,1L,test123@hotmail.com)})
> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
> jaksld@jklaf.com)})
>
> Let me know if there's anything I can do to further help/fix this issue.
>
> Best regards,
>
> Mike
>

Re: Bug in nested foreach with ORDER after grouping with multiple keys

Posted by Michael Dalton <mw...@gmail.com>.
Thanks Ashutosh,

I can confirm this issue was resolved by upgrading to the latest stable
Hadoop build, 0.20.2. The cause was definitely MAPREDUCE-565.

Best regards,

Mike

On Wed, Apr 7, 2010 at 9:06 AM, Ashutosh Chauhan <ashutosh.chauhan@gmail.com
> wrote:

> Hi Mike,
>
> Glad that you debugged the issue. Once you try it out on upgraded
> hadoop version, can you let us know whether that resolved your problem
> or not. It seems issue occurs on hadoop 0.20 and is fixed in hadoop
> 0.20.1
>
> Ashutosh
>
> On Wed, Apr 7, 2010 at 05:19, Michael Dalton <mw...@gmail.com> wrote:
> > I can confirm that somehow the Partitioner isn't being respected --
> > SecondaryKeyPartitioner is ignored. This is due to
> > https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug
> in
> > Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to
> resolve
> > MAPREDUCE-565.
> >
> > Best regards,
> >
> > Mike
> >
> > On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <mw...@gmail.com>
> wrote:
> >
> >> I have identified the source of the bug: the secondary key optimizations
> >> introduced in PIG-1038. If you run Pig with
> -Dpig.exec.nosecondarykey=true
> >> then you get the correct result. I will try to get a patch together.
> >>
> >> Best regards,
> >>
> >> Mike
> >>
> >>
> >> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <mwdalton@gmail.com
> >wrote:
> >>
> >>> Hi,
> >>>
> >>> I've hit a somewhat obscure bug in the scripts I'm writing caused by
> the
> >>> combination of a few factors: multiple column groups, PARALLEL > 1 for
> >>> grouping, and a nested for-each body following the group that sorts
> using
> >>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1,
> changing
> >>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The
> end
> >>> result is that the final GROUP/ORDER occurs with the incorrect group
> key,
> >>> causing incorrect output.
> >>>
> >>> I have a tiny input file that generates this behavior:
> >>> http://pastebin.com/UQZkug8Y
> >>> <http://pastebin.com/UQZkug8Y>
> >>> Here is a script showing the behavior in question:
> >>>   log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int,
> >>> email:chararray, subject:chararray, msgid:long);
> >>>   group_email = GROUP log BY (userid, email) PARALLEL 10;
> >>>   email_count = FOREACH group_email GENERATE group.userid, COUNT(log)
> AS
> >>> count, group.email;
> >>>   group_user = GROUP email_count BY userid PARALLEL 10;
> >>>   top_for_user = FOREACH group_user {
> >>>     sorted_count = ORDER email_count BY count DESC;
> >>>     GENERATE group, sorted_count;
> >>>   }
> >>>   DUMP top_for_user;
> >>>
> >>> The expected output here should be that each (userid, sorted_list) pair
> >>> should occur once, with the list sorted in descending order by count.
> >>> However, instead many (userid, partial_fragment_of_sorted_list) pairs
> appear
> >>> for the same userid. Interestingly enough, each one of the 'count'
> fields is
> >>> correct. If I had to hazard a guess, perhaps the composite key (userid,
> >>> email) from the first GROUP operation is being re-used or multiple
> >>> operations are being pushed into the same reducer despite requiring a
> >>> different ordering/grouping.
> >>>
> >>> Here is the (incorrect) output from the above script:
> >>>  (100,{(100,1L,b@hotmail.com),(100,1L,d@hotmail.com),(100,1L,
> >>> abc123@hotmail.com),(100,1L,test123@hotmail.com),(100,1L,d@hotmail.co
> >>> ),(100,1L,there@foo.com)})
> >>> (100,{(100,2L,hello@foo.com)})
> >>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds
> @.com),(101,1L,
> >>> jaksld@jklaf.com)})
> >>>
> >>> Note how there are two entries for userid 100, which should be
> >>> impossible. Here is the output if I change GROUP email_count BY userid
> >>> PARALLEL 10 to use PARALLEL 1 instead. This produces the
> correct/expected
> >>> result:
> >>> (100,{(100,2L,hello@foo.com),(100,1L,b@hotmail.com),(100,1L,
> there@foo.com
> >>> ),(100,1L,d@hotmail.com),(100,1L,abc123@hotmail.com),(100,1L,
> d@hotmail.co
> >>> ),(100,1L,test123@hotmail.com)})
> >>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds
> @.com),(101,1L,
> >>> jaksld@jklaf.com)})
> >>>
> >>> Let me know if there's anything I can do to further help/fix this
> issue.
> >>>
> >>> Best regards,
> >>>
> >>> Mike
> >>>
> >>
> >>
> >
>

Re: Bug in nested foreach with ORDER after grouping with multiple keys

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hi Mike,

Glad that you debugged the issue. Once you try it out on upgraded
hadoop version, can you let us know whether that resolved your problem
or not. It seems issue occurs on hadoop 0.20 and is fixed in hadoop
0.20.1

Ashutosh

On Wed, Apr 7, 2010 at 05:19, Michael Dalton <mw...@gmail.com> wrote:
> I can confirm that somehow the Partitioner isn't being respected --
> SecondaryKeyPartitioner is ignored. This is due to
> https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug in
> Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to resolve
> MAPREDUCE-565.
>
> Best regards,
>
> Mike
>
> On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <mw...@gmail.com> wrote:
>
>> I have identified the source of the bug: the secondary key optimizations
>> introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true
>> then you get the correct result. I will try to get a patch together.
>>
>> Best regards,
>>
>> Mike
>>
>>
>> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <mw...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I've hit a somewhat obscure bug in the scripts I'm writing caused by the
>>> combination of a few factors: multiple column groups, PARALLEL > 1 for
>>> grouping, and a nested for-each body following the group that sorts using
>>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing
>>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end
>>> result is that the final GROUP/ORDER occurs with the incorrect group key,
>>> causing incorrect output.
>>>
>>> I have a tiny input file that generates this behavior:
>>> http://pastebin.com/UQZkug8Y
>>> <http://pastebin.com/UQZkug8Y>
>>> Here is a script showing the behavior in question:
>>>   log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int,
>>> email:chararray, subject:chararray, msgid:long);
>>>   group_email = GROUP log BY (userid, email) PARALLEL 10;
>>>   email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS
>>> count, group.email;
>>>   group_user = GROUP email_count BY userid PARALLEL 10;
>>>   top_for_user = FOREACH group_user {
>>>     sorted_count = ORDER email_count BY count DESC;
>>>     GENERATE group, sorted_count;
>>>   }
>>>   DUMP top_for_user;
>>>
>>> The expected output here should be that each (userid, sorted_list) pair
>>> should occur once, with the list sorted in descending order by count.
>>> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear
>>> for the same userid. Interestingly enough, each one of the 'count' fields is
>>> correct. If I had to hazard a guess, perhaps the composite key (userid,
>>> email) from the first GROUP operation is being re-used or multiple
>>> operations are being pushed into the same reducer despite requiring a
>>> different ordering/grouping.
>>>
>>> Here is the (incorrect) output from the above script:
>>>  (100,{(100,1L,b@hotmail.com),(100,1L,d@hotmail.com),(100,1L,
>>> abc123@hotmail.com),(100,1L,test123@hotmail.com),(100,1L,d@hotmail.co
>>> ),(100,1L,there@foo.com)})
>>> (100,{(100,2L,hello@foo.com)})
>>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>>> jaksld@jklaf.com)})
>>>
>>> Note how there are two entries for userid 100, which should be
>>> impossible. Here is the output if I change GROUP email_count BY userid
>>> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected
>>> result:
>>> (100,{(100,2L,hello@foo.com),(100,1L,b@hotmail.com),(100,1L,there@foo.com
>>> ),(100,1L,d@hotmail.com),(100,1L,abc123@hotmail.com),(100,1L,d@hotmail.co
>>> ),(100,1L,test123@hotmail.com)})
>>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>>> jaksld@jklaf.com)})
>>>
>>> Let me know if there's anything I can do to further help/fix this issue.
>>>
>>> Best regards,
>>>
>>> Mike
>>>
>>
>>
>

Re: Bug in nested foreach with ORDER after grouping with multiple keys

Posted by Michael Dalton <mw...@gmail.com>.
I can confirm that somehow the Partitioner isn't being respected --
SecondaryKeyPartitioner is ignored. This is due to
https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug in
Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to resolve
MAPREDUCE-565.

Best regards,

Mike

On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <mw...@gmail.com> wrote:

> I have identified the source of the bug: the secondary key optimizations
> introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true
> then you get the correct result. I will try to get a patch together.
>
> Best regards,
>
> Mike
>
>
> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <mw...@gmail.com>wrote:
>
>> Hi,
>>
>> I've hit a somewhat obscure bug in the scripts I'm writing caused by the
>> combination of a few factors: multiple column groups, PARALLEL > 1 for
>> grouping, and a nested for-each body following the group that sorts using
>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing
>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end
>> result is that the final GROUP/ORDER occurs with the incorrect group key,
>> causing incorrect output.
>>
>> I have a tiny input file that generates this behavior:
>> http://pastebin.com/UQZkug8Y
>> <http://pastebin.com/UQZkug8Y>
>> Here is a script showing the behavior in question:
>>   log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int,
>> email:chararray, subject:chararray, msgid:long);
>>   group_email = GROUP log BY (userid, email) PARALLEL 10;
>>   email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS
>> count, group.email;
>>   group_user = GROUP email_count BY userid PARALLEL 10;
>>   top_for_user = FOREACH group_user {
>>     sorted_count = ORDER email_count BY count DESC;
>>     GENERATE group, sorted_count;
>>   }
>>   DUMP top_for_user;
>>
>> The expected output here should be that each (userid, sorted_list) pair
>> should occur once, with the list sorted in descending order by count.
>> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear
>> for the same userid. Interestingly enough, each one of the 'count' fields is
>> correct. If I had to hazard a guess, perhaps the composite key (userid,
>> email) from the first GROUP operation is being re-used or multiple
>> operations are being pushed into the same reducer despite requiring a
>> different ordering/grouping.
>>
>> Here is the (incorrect) output from the above script:
>>  (100,{(100,1L,b@hotmail.com),(100,1L,d@hotmail.com),(100,1L,
>> abc123@hotmail.com),(100,1L,test123@hotmail.com),(100,1L,d@hotmail.co
>> ),(100,1L,there@foo.com)})
>> (100,{(100,2L,hello@foo.com)})
>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>> jaksld@jklaf.com)})
>>
>> Note how there are two entries for userid 100, which should be
>> impossible. Here is the output if I change GROUP email_count BY userid
>> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected
>> result:
>> (100,{(100,2L,hello@foo.com),(100,1L,b@hotmail.com),(100,1L,there@foo.com
>> ),(100,1L,d@hotmail.com),(100,1L,abc123@hotmail.com),(100,1L,d@hotmail.co
>> ),(100,1L,test123@hotmail.com)})
>> (101,{(101,1L,jaksld@jkalf.com),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>> jaksld@jklaf.com)})
>>
>> Let me know if there's anything I can do to further help/fix this issue.
>>
>> Best regards,
>>
>> Mike
>>
>
>