You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Chan, Tim" <tc...@edmunds.com> on 2012/04/16 22:31:03 UTC

ordering tuple after grouping

Given data:

(1, 55, abc)
(2, 23, asd)
(1, 85, xyz)
(1, 2, aaa)


I would like to group on $0 and then have my grouped tuple be ordered by $1. Is this possible?

The output should look like this:

(1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
(2, {(2,23,asd)})


Then I would like to keep the first tuple for every group.

For example:

(1,2,aaa)
(2,23,asd)



Re: ordering tuple after grouping

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
I see, I hadn't got your suggestion.
You meant replacing both ORDER and LIMIT with TOP.
Makes sense, thanks.

Cheers,
--
Gianmarco



On Tue, Apr 17, 2012 at 11:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Top doesn't need to sort the whole relation; it can be done in a streaming
> fashion over any collection (n log k, where k << n). Plus it's algebraic
> (associative), since top 10 of a set is top 10 of all the top 10s of a
> covering collection of subsets.
>
> On Apr 17, 2012, at 1:03 AM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Hi Dmitriy,
> >
> > Can you explain which is the difference in the execution plan?
> > And if there is a performance difference, shouldn't we try to fix it?
> >
> > Cheers,
> > --
> > Gianmarco
> >
> >
> >
> > On Tue, Apr 17, 2012 at 09:47, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> This works, but isn't the most efficient thing in the world.
> >> Try using the TOP udf instead.
> >> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/builtin/TOP.html
> >>
> >> On Mon, Apr 16, 2012 at 5:22 PM, Russell Jurney
> >> <ru...@gmail.com> wrote:
> >>> Or even:
> >>>
> >>> ordered = foreach (group data by $0) { sorted = order data by $1; first
> >> = limit sorted 1; generate first; }
> >>>
> >>>
> >>> Russell Jurney http://datasyndrome.com
> >>>
> >>> On Apr 16, 2012, at 4:03 PM, "Chan, Tim" <tc...@edmunds.com> wrote:
> >>>
> >>>> Dear Gianmarco,
> >>>>
> >>>> It works great! Thanks.
> >>>>
> >>>> Tim
> >>>> ________________________________________
> >>>> From: Gianmarco De Francisci Morales [gdfm@apache.org]
> >>>> Sent: Monday, April 16, 2012 1:43 PM
> >>>> To: user@pig.apache.org
> >>>> Subject: Re: ordering tuple after grouping
> >>>>
> >>>> Sure,
> >>>> use a nested foreach.
> >>>>
> >>>> grouped = group data by $0;
> >>>> ordered = foreach grouped {
> >>>> sorted = order data by $1;
> >>>> first = limit sorted 1;
> >>>> generate first;
> >>>> }
> >>>>
> >>>> Beware, untested code.
> >>>>
> >>>> Cheers,
> >>>> --
> >>>> Gianmarco
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:
> >>>>
> >>>>> Given data:
> >>>>>
> >>>>> (1, 55, abc)
> >>>>> (2, 23, asd)
> >>>>> (1, 85, xyz)
> >>>>> (1, 2, aaa)
> >>>>>
> >>>>>
> >>>>> I would like to group on $0 and then have my grouped tuple be ordered
> >> by
> >>>>> $1. Is this possible?
> >>>>>
> >>>>> The output should look like this:
> >>>>>
> >>>>> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
> >>>>> (2, {(2,23,asd)})
> >>>>>
> >>>>>
> >>>>> Then I would like to keep the first tuple for every group.
> >>>>>
> >>>>> For example:
> >>>>>
> >>>>> (1,2,aaa)
> >>>>> (2,23,asd)
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>
>

Re: ordering tuple after grouping

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Top doesn't need to sort the whole relation; it can be done in a streaming fashion over any collection (n log k, where k << n). Plus it's algebraic (associative), since top 10 of a set is top 10 of all the top 10s of a covering collection of subsets. 

On Apr 17, 2012, at 1:03 AM, Gianmarco De Francisci Morales <gd...@apache.org> wrote:

> Hi Dmitriy,
> 
> Can you explain which is the difference in the execution plan?
> And if there is a performance difference, shouldn't we try to fix it?
> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> On Tue, Apr 17, 2012 at 09:47, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> 
>> This works, but isn't the most efficient thing in the world.
>> Try using the TOP udf instead.
>> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/builtin/TOP.html
>> 
>> On Mon, Apr 16, 2012 at 5:22 PM, Russell Jurney
>> <ru...@gmail.com> wrote:
>>> Or even:
>>> 
>>> ordered = foreach (group data by $0) { sorted = order data by $1; first
>> = limit sorted 1; generate first; }
>>> 
>>> 
>>> Russell Jurney http://datasyndrome.com
>>> 
>>> On Apr 16, 2012, at 4:03 PM, "Chan, Tim" <tc...@edmunds.com> wrote:
>>> 
>>>> Dear Gianmarco,
>>>> 
>>>> It works great! Thanks.
>>>> 
>>>> Tim
>>>> ________________________________________
>>>> From: Gianmarco De Francisci Morales [gdfm@apache.org]
>>>> Sent: Monday, April 16, 2012 1:43 PM
>>>> To: user@pig.apache.org
>>>> Subject: Re: ordering tuple after grouping
>>>> 
>>>> Sure,
>>>> use a nested foreach.
>>>> 
>>>> grouped = group data by $0;
>>>> ordered = foreach grouped {
>>>> sorted = order data by $1;
>>>> first = limit sorted 1;
>>>> generate first;
>>>> }
>>>> 
>>>> Beware, untested code.
>>>> 
>>>> Cheers,
>>>> --
>>>> Gianmarco
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:
>>>> 
>>>>> Given data:
>>>>> 
>>>>> (1, 55, abc)
>>>>> (2, 23, asd)
>>>>> (1, 85, xyz)
>>>>> (1, 2, aaa)
>>>>> 
>>>>> 
>>>>> I would like to group on $0 and then have my grouped tuple be ordered
>> by
>>>>> $1. Is this possible?
>>>>> 
>>>>> The output should look like this:
>>>>> 
>>>>> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
>>>>> (2, {(2,23,asd)})
>>>>> 
>>>>> 
>>>>> Then I would like to keep the first tuple for every group.
>>>>> 
>>>>> For example:
>>>>> 
>>>>> (1,2,aaa)
>>>>> (2,23,asd)
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 

Re: ordering tuple after grouping

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi Dmitriy,

Can you explain which is the difference in the execution plan?
And if there is a performance difference, shouldn't we try to fix it?

Cheers,
--
Gianmarco



On Tue, Apr 17, 2012 at 09:47, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> This works, but isn't the most efficient thing in the world.
> Try using the TOP udf instead.
> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/builtin/TOP.html
>
> On Mon, Apr 16, 2012 at 5:22 PM, Russell Jurney
> <ru...@gmail.com> wrote:
> > Or even:
> >
> > ordered = foreach (group data by $0) { sorted = order data by $1; first
> = limit sorted 1; generate first; }
> >
> >
> > Russell Jurney http://datasyndrome.com
> >
> > On Apr 16, 2012, at 4:03 PM, "Chan, Tim" <tc...@edmunds.com> wrote:
> >
> >> Dear Gianmarco,
> >>
> >> It works great! Thanks.
> >>
> >> Tim
> >> ________________________________________
> >> From: Gianmarco De Francisci Morales [gdfm@apache.org]
> >> Sent: Monday, April 16, 2012 1:43 PM
> >> To: user@pig.apache.org
> >> Subject: Re: ordering tuple after grouping
> >>
> >> Sure,
> >> use a nested foreach.
> >>
> >> grouped = group data by $0;
> >> ordered = foreach grouped {
> >>  sorted = order data by $1;
> >>  first = limit sorted 1;
> >>  generate first;
> >> }
> >>
> >> Beware, untested code.
> >>
> >> Cheers,
> >> --
> >> Gianmarco
> >>
> >>
> >>
> >> On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:
> >>
> >>> Given data:
> >>>
> >>> (1, 55, abc)
> >>> (2, 23, asd)
> >>> (1, 85, xyz)
> >>> (1, 2, aaa)
> >>>
> >>>
> >>> I would like to group on $0 and then have my grouped tuple be ordered
> by
> >>> $1. Is this possible?
> >>>
> >>> The output should look like this:
> >>>
> >>> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
> >>> (2, {(2,23,asd)})
> >>>
> >>>
> >>> Then I would like to keep the first tuple for every group.
> >>>
> >>> For example:
> >>>
> >>> (1,2,aaa)
> >>> (2,23,asd)
> >>>
> >>>
> >>>
> >>
>

Re: ordering tuple after grouping

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
This works, but isn't the most efficient thing in the world.
Try using the TOP udf instead.
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/builtin/TOP.html

On Mon, Apr 16, 2012 at 5:22 PM, Russell Jurney
<ru...@gmail.com> wrote:
> Or even:
>
> ordered = foreach (group data by $0) { sorted = order data by $1; first = limit sorted 1; generate first; }
>
>
> Russell Jurney http://datasyndrome.com
>
> On Apr 16, 2012, at 4:03 PM, "Chan, Tim" <tc...@edmunds.com> wrote:
>
>> Dear Gianmarco,
>>
>> It works great! Thanks.
>>
>> Tim
>> ________________________________________
>> From: Gianmarco De Francisci Morales [gdfm@apache.org]
>> Sent: Monday, April 16, 2012 1:43 PM
>> To: user@pig.apache.org
>> Subject: Re: ordering tuple after grouping
>>
>> Sure,
>> use a nested foreach.
>>
>> grouped = group data by $0;
>> ordered = foreach grouped {
>>  sorted = order data by $1;
>>  first = limit sorted 1;
>>  generate first;
>> }
>>
>> Beware, untested code.
>>
>> Cheers,
>> --
>> Gianmarco
>>
>>
>>
>> On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:
>>
>>> Given data:
>>>
>>> (1, 55, abc)
>>> (2, 23, asd)
>>> (1, 85, xyz)
>>> (1, 2, aaa)
>>>
>>>
>>> I would like to group on $0 and then have my grouped tuple be ordered by
>>> $1. Is this possible?
>>>
>>> The output should look like this:
>>>
>>> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
>>> (2, {(2,23,asd)})
>>>
>>>
>>> Then I would like to keep the first tuple for every group.
>>>
>>> For example:
>>>
>>> (1,2,aaa)
>>> (2,23,asd)
>>>
>>>
>>>
>>

Re: ordering tuple after grouping

Posted by Russell Jurney <ru...@gmail.com>.
Or even:

ordered = foreach (group data by $0) { sorted = order data by $1; first = limit sorted 1; generate first; }


Russell Jurney http://datasyndrome.com

On Apr 16, 2012, at 4:03 PM, "Chan, Tim" <tc...@edmunds.com> wrote:

> Dear Gianmarco,
> 
> It works great! Thanks.
> 
> Tim
> ________________________________________
> From: Gianmarco De Francisci Morales [gdfm@apache.org]
> Sent: Monday, April 16, 2012 1:43 PM
> To: user@pig.apache.org
> Subject: Re: ordering tuple after grouping
> 
> Sure,
> use a nested foreach.
> 
> grouped = group data by $0;
> ordered = foreach grouped {
>  sorted = order data by $1;
>  first = limit sorted 1;
>  generate first;
> }
> 
> Beware, untested code.
> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:
> 
>> Given data:
>> 
>> (1, 55, abc)
>> (2, 23, asd)
>> (1, 85, xyz)
>> (1, 2, aaa)
>> 
>> 
>> I would like to group on $0 and then have my grouped tuple be ordered by
>> $1. Is this possible?
>> 
>> The output should look like this:
>> 
>> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
>> (2, {(2,23,asd)})
>> 
>> 
>> Then I would like to keep the first tuple for every group.
>> 
>> For example:
>> 
>> (1,2,aaa)
>> (2,23,asd)
>> 
>> 
>> 
> 

RE: ordering tuple after grouping

Posted by "Chan, Tim" <tc...@edmunds.com>.
Dear Gianmarco,

It works great! Thanks.

Tim
________________________________________
From: Gianmarco De Francisci Morales [gdfm@apache.org]
Sent: Monday, April 16, 2012 1:43 PM
To: user@pig.apache.org
Subject: Re: ordering tuple after grouping

Sure,
use a nested foreach.

grouped = group data by $0;
ordered = foreach grouped {
  sorted = order data by $1;
  first = limit sorted 1;
  generate first;
}

Beware, untested code.

Cheers,
--
Gianmarco



On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:

> Given data:
>
> (1, 55, abc)
> (2, 23, asd)
> (1, 85, xyz)
> (1, 2, aaa)
>
>
> I would like to group on $0 and then have my grouped tuple be ordered by
> $1. Is this possible?
>
> The output should look like this:
>
> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
> (2, {(2,23,asd)})
>
>
> Then I would like to keep the first tuple for every group.
>
> For example:
>
> (1,2,aaa)
> (2,23,asd)
>
>
>


Re: ordering tuple after grouping

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Sure,
use a nested foreach.

grouped = group data by $0;
ordered = foreach grouped {
  sorted = order data by $1;
  first = limit sorted 1;
  generate first;
}

Beware, untested code.

Cheers,
--
Gianmarco



On Mon, Apr 16, 2012 at 22:31, Chan, Tim <tc...@edmunds.com> wrote:

> Given data:
>
> (1, 55, abc)
> (2, 23, asd)
> (1, 85, xyz)
> (1, 2, aaa)
>
>
> I would like to group on $0 and then have my grouped tuple be ordered by
> $1. Is this possible?
>
> The output should look like this:
>
> (1, {(1, 2, aaa),(1,55,abc),(1,85,xyz)})
> (2, {(2,23,asd)})
>
>
> Then I would like to keep the first tuple for every group.
>
> For example:
>
> (1,2,aaa)
> (2,23,asd)
>
>
>