You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by chuang liu <li...@gmail.com> on 2009/12/18 01:46:14 UTC

confusing top N results

Hi:

We tried to get top N results after a groupby and sort, and got different
results with or without storing the full sorted results. Here is a skeleton
of our pig script.

  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
  grouped = group raw_data by (f1, f2);
  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
  ordered = order data by value DESC parallel 10;
  topn = limit ordered 10;
  store ordered into 'outputdir/full';
  store topn into 'outputdir/topn';

With the statement 'store ordered ...', top N results are incorrect, but
without the statement, results are correct. Has anyone seen this before? I
know a similar bug has been fixed in the multi-query release. We are on pig
.4 and hadoop .20.1.

Thanks.

Chuang

RE: confusing top N results

Posted by Richard Ding <rd...@yahoo-inc.com>.
Can you try using 'parallel 1' instead of 'parallel 10' in your script? 

Thanks,
-Richard
-----Original Message-----
From: chuang liu [mailto:liuchuangyj@gmail.com] 
Sent: Monday, December 21, 2009 2:40 PM
To: pig-user@hadoop.apache.org
Subject: Re: confusing top N results

Thanks for your reply.  The problem happened for both ASC and DESC.

I have not tried the latest trunk yet. Right now, to get around the
problem,
I store the full list first, and then load the full list, and limit
records
to top N.

Chuang

On Fri, Dec 18, 2009 at 11:06 AM, Jianyong Dai
<ji...@yahoo-inc.com>wrote:

> Have you tried latest trunk code? There are couple of bug fixes for
limited
> sort lately. Also do you only see this in order by DESC?
>
>
> chuang liu wrote:
>
>> the full ordered results are always correct. I only had problem with
the
>> top
>> N results .
>>
>> Chuang
>>
>> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <zaki.rahaman@gmail.com
>> >wrote:
>>
>>
>>
>>> What multi-query release are you referring to? With multiquery
execution
>>> on
>>> you should get the right results. I would check the logical and
physical
>>> execution plan. As another test if you run with only store ordered
see if
>>> the set of output files produced is correct.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com>
wrote:
>>>
>>>  Hi:
>>>
>>>
>>>> We tried to get top N results after a groupby and sort, and got
>>>> different
>>>> results with or without storing the full sorted results. Here is a
>>>> skeleton
>>>> of our pig script.
>>>>
>>>>  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>>>  grouped = group raw_data by (f1, f2);
>>>>  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk)
as
>>>> value;
>>>>  ordered = order data by value DESC parallel 10;
>>>>  topn = limit ordered 10;
>>>>  store ordered into 'outputdir/full';
>>>>  store topn into 'outputdir/topn';
>>>>
>>>> With the statement 'store ordered ...', top N results are
incorrect, but
>>>> without the statement, results are correct. Has anyone seen this
before?
>>>> I
>>>> know a similar bug has been fixed in the multi-query release. We
are on
>>>> pig
>>>> .4 and hadoop .20.1.
>>>>
>>>> Thanks.
>>>>
>>>> Chuang
>>>>
>>>>
>>>>
>>>
>

Re: confusing top N results

Posted by chuang liu <li...@gmail.com>.
Thanks for your reply.  The problem happened for both ASC and DESC.

I have not tried the latest trunk yet. Right now, to get around the problem,
I store the full list first, and then load the full list, and limit records
to top N.

Chuang

On Fri, Dec 18, 2009 at 11:06 AM, Jianyong Dai <ji...@yahoo-inc.com>wrote:

> Have you tried latest trunk code? There are couple of bug fixes for limited
> sort lately. Also do you only see this in order by DESC?
>
>
> chuang liu wrote:
>
>> the full ordered results are always correct. I only had problem with the
>> top
>> N results .
>>
>> Chuang
>>
>> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <zaki.rahaman@gmail.com
>> >wrote:
>>
>>
>>
>>> What multi-query release are you referring to? With multiquery execution
>>> on
>>> you should get the right results. I would check the logical and physical
>>> execution plan. As another test if you run with only store ordered see if
>>> the set of output files produced is correct.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>>>
>>>  Hi:
>>>
>>>
>>>> We tried to get top N results after a groupby and sort, and got
>>>> different
>>>> results with or without storing the full sorted results. Here is a
>>>> skeleton
>>>> of our pig script.
>>>>
>>>>  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>>>  grouped = group raw_data by (f1, f2);
>>>>  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>>>> value;
>>>>  ordered = order data by value DESC parallel 10;
>>>>  topn = limit ordered 10;
>>>>  store ordered into 'outputdir/full';
>>>>  store topn into 'outputdir/topn';
>>>>
>>>> With the statement 'store ordered ...', top N results are incorrect, but
>>>> without the statement, results are correct. Has anyone seen this before?
>>>> I
>>>> know a similar bug has been fixed in the multi-query release. We are on
>>>> pig
>>>> .4 and hadoop .20.1.
>>>>
>>>> Thanks.
>>>>
>>>> Chuang
>>>>
>>>>
>>>>
>>>
>

Re: confusing top N results

Posted by Jianyong Dai <ji...@yahoo-inc.com>.
Have you tried latest trunk code? There are couple of bug fixes for 
limited sort lately. Also do you only see this in order by DESC?

chuang liu wrote:
> the full ordered results are always correct. I only had problem with the top
> N results .
>
> Chuang
>
> On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <za...@gmail.com>wrote:
>
>   
>> What multi-query release are you referring to? With multiquery execution on
>> you should get the right results. I would check the logical and physical
>> execution plan. As another test if you run with only store ordered see if
>> the set of output files produced is correct.
>>
>> Sent from my iPhone
>>
>>
>> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>>
>>  Hi:
>>     
>>> We tried to get top N results after a groupby and sort, and got different
>>> results with or without storing the full sorted results. Here is a
>>> skeleton
>>> of our pig script.
>>>
>>>  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>>  grouped = group raw_data by (f1, f2);
>>>  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>>> value;
>>>  ordered = order data by value DESC parallel 10;
>>>  topn = limit ordered 10;
>>>  store ordered into 'outputdir/full';
>>>  store topn into 'outputdir/topn';
>>>
>>> With the statement 'store ordered ...', top N results are incorrect, but
>>> without the statement, results are correct. Has anyone seen this before? I
>>> know a similar bug has been fixed in the multi-query release. We are on
>>> pig
>>> .4 and hadoop .20.1.
>>>
>>> Thanks.
>>>
>>> Chuang
>>>
>>>       


Re: confusing top N results

Posted by chuang liu <li...@gmail.com>.
the full ordered results are always correct. I only had problem with the top
N results .

Chuang

On Thu, Dec 17, 2009 at 5:11 PM, Zaki Rahaman <za...@gmail.com>wrote:

> What multi-query release are you referring to? With multiquery execution on
> you should get the right results. I would check the logical and physical
> execution plan. As another test if you run with only store ordered see if
> the set of output files produced is correct.
>
> Sent from my iPhone
>
>
> On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:
>
>  Hi:
>>
>> We tried to get top N results after a groupby and sort, and got different
>> results with or without storing the full sorted results. Here is a
>> skeleton
>> of our pig script.
>>
>>  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>>  grouped = group raw_data by (f1, f2);
>>  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
>> value;
>>  ordered = order data by value DESC parallel 10;
>>  topn = limit ordered 10;
>>  store ordered into 'outputdir/full';
>>  store topn into 'outputdir/topn';
>>
>> With the statement 'store ordered ...', top N results are incorrect, but
>> without the statement, results are correct. Has anyone seen this before? I
>> know a similar bug has been fixed in the multi-query release. We are on
>> pig
>> .4 and hadoop .20.1.
>>
>> Thanks.
>>
>> Chuang
>>
>

Re: confusing top N results

Posted by Zaki Rahaman <za...@gmail.com>.
What multi-query release are you referring to? With multiquery  
execution on you should get the right results. I would check the  
logical and physical execution plan. As another test if you run with  
only store ordered see if the set of output files produced is correct.

Sent from my iPhone

On Dec 17, 2009, at 7:46 PM, chuang liu <li...@gmail.com> wrote:

> Hi:
>
> We tried to get top N results after a groupby and sort, and got  
> different
> results with or without storing the full sorted results. Here is a  
> skeleton
> of our pig script.
>
>  raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>  grouped = group raw_data by (f1, f2);
>  data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as  
> value;
>  ordered = order data by value DESC parallel 10;
>  topn = limit ordered 10;
>  store ordered into 'outputdir/full';
>  store topn into 'outputdir/topn';
>
> With the statement 'store ordered ...', top N results are incorrect,  
> but
> without the statement, results are correct. Has anyone seen this  
> before? I
> know a similar bug has been fixed in the multi-query release. We are  
> on pig
> .4 and hadoop .20.1.
>
> Thanks.
>
> Chuang

RE: confusing top N results

Posted by Richard Ding <rd...@yahoo-inc.com>.
PIG-1169.

Thanks,
-Richard
-----Original Message-----
From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
Sent: Tuesday, December 22, 2009 1:41 AM
To: pig-user@hadoop.apache.org
Subject: Re: confusing top N results


I have a feeling this is related to some other issue I have seen - 
probably the root is the same.
Someone attributed to some diamond optimization that pig tries to do
here.

Bottomline for me was, if there is a suffix tree between limit and 
store, or is a split after the limit, things fail.


Your example below might be additional datapoint for the dev's in fixing

this issue (I dont know if there is a JIRA on this).

Regards,
Mridul

chuang liu wrote:
> Hi:
> 
> We tried to get top N results after a groupby and sort, and got
different
> results with or without storing the full sorted results. Here is a
skeleton
> of our pig script.
> 
>   raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>   grouped = group raw_data by (f1, f2);
>   data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as
value;
>   ordered = order data by value DESC parallel 10;
>   topn = limit ordered 10;
>   store ordered into 'outputdir/full';
>   store topn into 'outputdir/topn';
> 
> With the statement 'store ordered ...', top N results are incorrect,
but
> without the statement, results are correct. Has anyone seen this
before? I
> know a similar bug has been fixed in the multi-query release. We are
on pig
> .4 and hadoop .20.1.
> 
> Thanks.
> 
> Chuang


Re: confusing top N results

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
I have a feeling this is related to some other issue I have seen - 
probably the root is the same.
Someone attributed to some diamond optimization that pig tries to do here.

Bottomline for me was, if there is a suffix tree between limit and 
store, or is a split after the limit, things fail.


Your example below might be additional datapoint for the dev's in fixing 
this issue (I dont know if there is a JIRA on this).

Regards,
Mridul

chuang liu wrote:
> Hi:
> 
> We tried to get top N results after a groupby and sort, and got different
> results with or without storing the full sorted results. Here is a skeleton
> of our pig script.
> 
>   raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
>   grouped = group raw_data by (f1, f2);
>   data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
>   ordered = order data by value DESC parallel 10;
>   topn = limit ordered 10;
>   store ordered into 'outputdir/full';
>   store topn into 'outputdir/topn';
> 
> With the statement 'store ordered ...', top N results are incorrect, but
> without the statement, results are correct. Has anyone seen this before? I
> know a similar bug has been fixed in the multi-query release. We are on pig
> .4 and hadoop .20.1.
> 
> Thanks.
> 
> Chuang