You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by ugo jardonnet <ug...@gmail.com> on 2011/04/26 15:11:26 UTC

TOP ordering

Hi. I am looking for a way to get the result of top ordered. Is it possible
?

Example:

A = LOAD 'datatest' USING PigStorage(';') as (first: chararray, second:
int);
D = GROUP A BY first;
topResults = FOREACH D {
        result = TOP(3, 1, A);
        GENERATE flatten(result); -- unordered
};
dump topResults

best,

Re: TOP ordering

Posted by ugo jardonnet <ug...@gmail.com>.
mmm
In fact TOP doesn't order results. I was looking for a way to do this from
PIG.
The problem is TOP returns a bag which cannot be ordered. And of course
after the foreach its to late.

2011/4/26 Sven Krasser <kr...@gmail.com>

> At a glance it could be this: The first field in D.A is of type chararray,
> but TOP orders based on long.
> -Sven
>
> On Tue, Apr 26, 2011 at 6:11 AM, ugo jardonnet <ugo.jardonnet@gmail.com
> >wrote:
>
> > Hi. I am looking for a way to get the result of top ordered. Is it
> possible
> > ?
> >
> > Example:
> >
> > A = LOAD 'datatest' USING PigStorage(';') as (first: chararray, second:
> > int);
> > D = GROUP A BY first;
> > topResults = FOREACH D {
> >        result = TOP(3, 1, A);
> >        GENERATE flatten(result); -- unordered
> > };
> > dump topResults
> >
> > best,
> >
>
>
>
> --
> http://sites.google.com/site/krasser/
>

Re: TOP ordering

Posted by Sven Krasser <kr...@gmail.com>.
At a glance it could be this: The first field in D.A is of type chararray,
but TOP orders based on long.
-Sven

On Tue, Apr 26, 2011 at 6:11 AM, ugo jardonnet <ug...@gmail.com>wrote:

> Hi. I am looking for a way to get the result of top ordered. Is it possible
> ?
>
> Example:
>
> A = LOAD 'datatest' USING PigStorage(';') as (first: chararray, second:
> int);
> D = GROUP A BY first;
> topResults = FOREACH D {
>        result = TOP(3, 1, A);
>        GENERATE flatten(result); -- unordered
> };
> dump topResults
>
> best,
>



-- 
http://sites.google.com/site/krasser/

Re: TOP ordering

Posted by ugo jardonnet <ug...@gmail.com>.
2011/4/26 Dmitriy Ryaboy <dv...@gmail.com>

> This may be helpful in understanding what happens when you do a group-by:
> http://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig/
>
> Thank you very Much.


> Also, are you sure TOP doesn't give you items in order? It's a bag, but the
> implementation is such that flattening it should give you things in proper
> order (I think -- haven't tried).
>
>
I took a look at the implementation of TOP. The output bag is built
iterating on a priority_queue,
which is not supposed to "returns the elements in any particular order".

Thank both of you for clearing things out about foreach.

Re: TOP ordering

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
This may be helpful in understanding what happens when you do a group-by:
http://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig/

Also, are you sure TOP doesn't give you items in order? It's a bag, but the
implementation is such that flattening it should give you things in proper
order (I think -- haven't tried).

D

On Tue, Apr 26, 2011 at 9:54 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> A has changed.  A outside the foreach is a relation (all the records you
> loaded).  Inside the foreach A is a bag created by the group by.  So what
> this does is order the bag A by the second input, and then take the top 3
> records.  Actually, given that order by goes from least to greatest this
> will give the bottom 3 records.  You'll need to change it to 'srtd = order A
> by second desc;' to get the top 3.
>
> Alan.
>
>
> On Apr 26, 2011, at 9:36 AM, ugo jardonnet wrote:
>
>  2011/4/26 Alan Gates <ga...@yahoo-inc.com>
>>
>>  topResults = foreach D {
>>>      srtd = order A by second;
>>>      top3 = limit srtd 3;
>>>      generate flatten(top3);
>>> };
>>>
>>> Alan.
>>>
>>> Thank you Alan. It works perfectly.
>>>
>>
>> I realize I didn't really understood the mechanism behind foreach.
>> Reading this piece of code I would have expect each top3 to be the same.
>> I suppose A is filtered by D at the beginning of the loop ?
>>
>
>

Re: TOP ordering

Posted by Alan Gates <ga...@yahoo-inc.com>.
A has changed.  A outside the foreach is a relation (all the records  
you loaded).  Inside the foreach A is a bag created by the group by.   
So what this does is order the bag A by the second input, and then  
take the top 3 records.  Actually, given that order by goes from least  
to greatest this will give the bottom 3 records.  You'll need to  
change it to 'srtd = order A by second desc;' to get the top 3.

Alan.

On Apr 26, 2011, at 9:36 AM, ugo jardonnet wrote:

> 2011/4/26 Alan Gates <ga...@yahoo-inc.com>
>
>> topResults = foreach D {
>>       srtd = order A by second;
>>       top3 = limit srtd 3;
>>       generate flatten(top3);
>> };
>>
>> Alan.
>>
>> Thank you Alan. It works perfectly.
>
> I realize I didn't really understood the mechanism behind foreach.
> Reading this piece of code I would have expect each top3 to be the  
> same.
> I suppose A is filtered by D at the beginning of the loop ?


Re: TOP ordering

Posted by ugo jardonnet <ug...@gmail.com>.
2011/4/26 Alan Gates <ga...@yahoo-inc.com>

> topResults = foreach D {
>        srtd = order A by second;
>        top3 = limit srtd 3;
>        generate flatten(top3);
> };
>
> Alan.
>
> Thank you Alan. It works perfectly.

I realize I didn't really understood the mechanism behind foreach.
Reading this piece of code I would have expect each top3 to be the same.
I suppose A is filtered by D at the beginning of the loop ?

Re: TOP ordering

Posted by Alan Gates <ga...@yahoo-inc.com>.
topResults = foreach D {
	srtd = order A by second;
	top3 = limit srtd 3;
	generate flatten(top3);
};

Alan.
On Apr 26, 2011, at 6:11 AM, ugo jardonnet wrote:

> Hi. I am looking for a way to get the result of top ordered. Is it  
> possible
> ?
>
> Example:
>
> A = LOAD 'datatest' USING PigStorage(';') as (first: chararray,  
> second:
> int);
> D = GROUP A BY first;
> topResults = FOREACH D {
>        result = TOP(3, 1, A);
>        GENERATE flatten(result); -- unordered
> };
> dump topResults
>
> best,