You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by chuang liu <li...@gmail.com> on 2009/12/18 01:46:14 UTC
confusing top N results
Hi:
We tried to get top N results after a groupby and sort, and got different
results with or without storing the full sorted results. Here is a skeleton
of our pig script.
raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
grouped = group raw_data by (f1, f2);
data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
ordered = order data by value DESC parallel 10;
topn = limit ordered 10;
store ordered into 'outputdir/full';
store topn into 'outputdir/topn';
With the statement 'store ordered ...', top N results are incorrect, but
without the statement, results are correct. Has anyone seen this before? I
know a similar bug has been fixed in the multi-query release. We are on pig
.4 and hadoop .20.1.
Thanks.
Chuang
RE: confusing top N results
Posted by Richard Ding <rd...@yahoo-inc.com>.
Can you try using 'parallel 1' instead of 'parallel 10' in your script?
Thanks,
-Richard
-----Original Message-----
From: chuang liu [mailto:liuchuangyj@gmail.com]
Sent: Monday, December 21, 2009 2:40 PM
To: pig-user@hadoop.apache.org
Subject: Re: confusing top N results
Thanks for your reply. The problem happened for both ASC and DESC.
I have not tried the latest trunk yet. Right now, to get around the
problem,
I store the full list first, and then load the full list, and limit
records
to top N.
Chuang
On Fri, Dec 18, 2009 at 11:06 AM, Jianyong Dai
<ji...@yahoo-inc.com>wrote:
> Have you tried latest trunk code? There are couple of bug fixes for
limited
> sort lately. Also do you only see this in order by DESC?
>
>
> chuang liu wrote:
>
>> the full ordered results are always correct. I only had problem with
the
>> top
>> N results .
>>
>> Chuang
>>
>> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <zaki.rahaman@gmail.com
>> >wrote:
>>
>>
>>
>>> What multi-query release are you referring to? With multiquery
execution
>>> on
>>> you should get the right results. I would check the logical and
physical
>>> execution plan. As another test if you run with only store ordered
see if
>>> the set of output files produced is correct.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com>
wrote:
>>>
>>> Hi:
>>>
>>>
>>>> We tried to get top N results after a groupby and sort, and got
>>>> different
>>>> results with or without storing the full sorted results. Here is a
>>>> skeleton
>>>> of our pig script.
>>>>
>>>> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>>> grouped = group raw_data by (f1, f2);
>>>> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk)
as
>>>> value;
>>>> ordered = order data by value DESC parallel 10;
>>>> topn = limit ordered 10;
>>>> store ordered into 'outputdir/full';
>>>> store topn into 'outputdir/topn';
>>>>
>>>> With the statement 'store ordered ...', top N results are
incorrect, but
>>>> without the statement, results are correct. Has anyone seen this
before?
>>>> I
>>>> know a similar bug has been fixed in the multi-query release. We
are on
>>>> pig
>>>> .4 and hadoop .20.1.
>>>>
>>>> Thanks.
>>>>
>>>> Chuang
>>>>
>>>>
>>>>
>>>
>
Re: confusing top N results
Posted by chuang liu <li...@gmail.com>.
Thanks for your reply. The problem happened for both ASC and DESC.
I have not tried the latest trunk yet. Right now, to get around the problem,
I store the full list first, and then load the full list, and limit records
to top N.
Chuang
On Fri, Dec 18, 2009 at 11:06 AM, Jianyong Dai <ji...@yahoo-inc.com>wrote:
> Have you tried latest trunk code? There are couple of bug fixes for limited
> sort lately. Also do you only see this in order by DESC?
>
>
> chuang liu wrote:
>
>> the full ordered results are always correct. I only had problem with the
>> top
>> N results .
>>
>> Chuang
>>
>> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <zaki.rahaman@gmail.com
>> >wrote:
>>
>>
>>
>>> What multi-query release are you referring to? With multiquery execution
>>> on
>>> you should get the right results. I would check the logical and physical
>>> execution plan. As another test if you run with only store ordered see if
>>> the set of output files produced is correct.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>>>
>>> Hi:
>>>
>>>
>>>> We tried to get top N results after a groupby and sort, and got
>>>> different
>>>> results with or without storing the full sorted results. Here is a
>>>> skeleton
>>>> of our pig script.
>>>>
>>>> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>>> grouped = group raw_data by (f1, f2);
>>>> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>>>> value;
>>>> ordered = order data by value DESC parallel 10;
>>>> topn = limit ordered 10;
>>>> store ordered into 'outputdir/full';
>>>> store topn into 'outputdir/topn';
>>>>
>>>> With the statement 'store ordered ...', top N results are incorrect, but
>>>> without the statement, results are correct. Has anyone seen this before?
>>>> I
>>>> know a similar bug has been fixed in the multi-query release. We are on
>>>> pig
>>>> .4 and hadoop .20.1.
>>>>
>>>> Thanks.
>>>>
>>>> Chuang
>>>>
>>>>
>>>>
>>>
>
Re: confusing top N results
Posted by Jianyong Dai <ji...@yahoo-inc.com>.
Have you tried latest trunk code? There are couple of bug fixes for
limited sort lately. Also do you only see this in order by DESC?
chuang liu wrote:
> the full ordered results are always correct. I only had problem with the top
> N results .
>
> Chuang
>
> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <za...@gmail.com>wrote:
>
>
>> What multi-query release are you referring to? With multiquery execution on
>> you should get the right results. I would check the logical and physical
>> execution plan. As another test if you run with only store ordered see if
>> the set of output files produced is correct.
>>
>> Sent from my iPhone
>>
>>
>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>>
>> Hi:
>>
>>> We tried to get top N results after a groupby and sort, and got different
>>> results with or without storing the full sorted results. Here is a
>>> skeleton
>>> of our pig script.
>>>
>>> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>> grouped = group raw_data by (f1, f2);
>>> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>>> value;
>>> ordered = order data by value DESC parallel 10;
>>> topn = limit ordered 10;
>>> store ordered into 'outputdir/full';
>>> store topn into 'outputdir/topn';
>>>
>>> With the statement 'store ordered ...', top N results are incorrect, but
>>> without the statement, results are correct. Has anyone seen this before? I
>>> know a similar bug has been fixed in the multi-query release. We are on
>>> pig
>>> .4 and hadoop .20.1.
>>>
>>> Thanks.
>>>
>>> Chuang
>>>
>>>
Re: confusing top N results
Posted by chuang liu <li...@gmail.com>.
the full ordered results are always correct. I only had problem with the top
N results .
Chuang
On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <za...@gmail.com>wrote:
> What multi-query release are you referring to? With multiquery execution on
> you should get the right results. I would check the logical and physical
> execution plan. As another test if you run with only store ordered see if
> the set of output files produced is correct.
>
> Sent from my iPhone
>
>
> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>
> Hi:
>>
>> We tried to get top N results after a groupby and sort, and got different
>> results with or without storing the full sorted results. Here is a
>> skeleton
>> of our pig script.
>>
>> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>> grouped = group raw_data by (f1, f2);
>> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>> value;
>> ordered = order data by value DESC parallel 10;
>> topn = limit ordered 10;
>> store ordered into 'outputdir/full';
>> store topn into 'outputdir/topn';
>>
>> With the statement 'store ordered ...', top N results are incorrect, but
>> without the statement, results are correct. Has anyone seen this before? I
>> know a similar bug has been fixed in the multi-query release. We are on
>> pig
>> .4 and hadoop .20.1.
>>
>> Thanks.
>>
>> Chuang
>>
>
Re: confusing top N results
Posted by Zaki Rahaman <za...@gmail.com>.
What multi-query release are you referring to? With multiquery
execution on you should get the right results. I would check the
logical and physical execution plan. As another test if you run with
only store ordered see if the set of output files produced is correct.
Sent from my iPhone
On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
> Hi:
>
> We tried to get top N results after a groupby and sort, and got
> different
> results with or without storing the full sorted results. Here is a
> skeleton
> of our pig script.
>
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
> value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
>
> With the statement 'store ordered ...', top N results are incorrect,
> but
> without the statement, results are correct. Has anyone seen this
> before? I
> know a similar bug has been fixed in the multi-query release. We are
> on pig
> .4 and hadoop .20.1.
>
> Thanks.
>
> Chuang
RE: confusing top N results
Posted by Richard Ding <rd...@yahoo-inc.com>.
PIG-1169.
Thanks,
-Richard
-----Original Message-----
From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
Sent: Tuesday, December 22, 2009 1:41 AM
To: pig-user@hadoop.apache.org
Subject: Re: confusing top N results
I have a feeling this is related to some other issue I have seen -
probably the root is the same.
Someone attributed to some diamond optimization that pig tries to do
here.
Bottomline for me was, if there is a suffix tree between limit and
store, or is a split after the limit, things fail.
Your example below might be additional datapoint for the dev's in fixing
this issue (I dont know if there is a JIRA on this).
Regards,
Mridul
chuang liu wrote:
> Hi:
>
> We tried to get top N results after a groupby and sort, and got
different
> results with or without storing the full sorted results. Here is a
skeleton
> of our pig script.
>
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
>
> With the statement 'store ordered ...', top N results are incorrect,
but
> without the statement, results are correct. Has anyone seen this
before? I
> know a similar bug has been fixed in the multi-query release. We are
on pig
> .4 and hadoop .20.1.
>
> Thanks.
>
> Chuang
Re: confusing top N results
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
I have a feeling this is related to some other issue I have seen -
probably the root is the same.
Someone attributed to some diamond optimization that pig tries to do here.
Bottomline for me was, if there is a suffix tree between limit and
store, or is a split after the limit, things fail.
Your example below might be additional datapoint for the dev's in fixing
this issue (I dont know if there is a JIRA on this).
Regards,
Mridul
chuang liu wrote:
> Hi:
>
> We tried to get top N results after a groupby and sort, and got different
> results with or without storing the full sorted results. Here is a skeleton
> of our pig script.
>
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
>
> With the statement 'store ordered ...', top N results are incorrect, but
> without the statement, results are correct. Has anyone seen this before? I
> know a similar bug has been fixed in the multi-query release. We are on pig
> .4 and hadoop .20.1.
>
> Thanks.
>
> Chuang