You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2013/03/18 11:23:28 UTC

nested order limit by percentage of overall records

Hi there,

I would like to do something very similar to a nested foreach with using
order by and then limit. But I would like to limit on a relation to the
total number of records.

users = load 'users' as (userid:chararray, money:long, region:chararray);
grouped_region = group users by region;
top_10_percent = foreach grouped_region {
            sorted = order users by money desc;
            top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the top
10% it would be total users/10 in that region.
            generate group, flatten(top);
};

Thanks a lot for any help on this.

Cheers,
-Marco

Re: nested order limit by percentage of overall records

Posted by Marco Cadetg <ma...@zattoo.com>.
Actually what I was looking for isn't for distributed quantiles. I was
looking for the share top x% do have. E.g. in my example it could be that
the top 10% of the users do have 50% of the total money.

So it looks like I'll need to come up with a UDF which delivers this.

Cheers,
-Marco
On 19 Mar 2013 00:23, "Mike Sukmanowsky" <mi...@parsely.com> wrote:

> Distributed quantiles aren't an easy problem to solve (as you can see from
> LinkedIn's source) but perhaps in time they'll be brought into core
> functions.  It wasn't until 0.11.0 that date/time functions were brought
> into built-in.  Had to use a combination of Piggybank and custom UDFs.
>
>
> On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Thanks a lot Mike. This seems to be what I'm looking for ;)
> >
> > I'm a bit disappointed that what I wanted to achieve isn't possible
> without
> > using any UDF.
> >
> > Cheers,
> > -Marco
> >
> >
> > On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <mi...@parsely.com>
> > wrote:
> >
> > > You should check out the quantile libraries in LinkedIn's DataFu UDFs:
> > > https://github.com/linkedin/datafu specifically
> > >
> > >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
> > > relatively small inputs, and
> > >
> > >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
> > > larger inputs.
> > >
> > > You can use this to receive the top x% for any given field and then you
> > can
> > > use that within a filter
> > >
> > >
> > > On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <ma...@zattoo.com>
> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I would like to do something very similar to a nested foreach with
> > using
> > > > order by and then limit. But I would like to limit on a relation to
> the
> > > > total number of records.
> > > >
> > > > users = load 'users' as (userid:chararray, money:long,
> > region:chararray);
> > > > grouped_region = group users by region;
> > > > top_10_percent = foreach grouped_region {
> > > >             sorted = order users by money desc;
> > > >             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the
> > top
> > > > 10% it would be total users/10 in that region.
> > > >             generate group, flatten(top);
> > > > };
> > > >
> > > > Thanks a lot for any help on this.
> > > >
> > > > Cheers,
> > > > -Marco
> > > >
> > >
> > >
> > >
> > > --
> > > Mike Sukmanowsky
> > >
> > > Product Lead, http://parse.ly
> > > 989 Avenue of the Americas, 3rd Floor
> > > New York, NY  10018
> > > p: +1 (416) 953-4248
> > > e: mike@parsely.com
> > >
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: mike@parsely.com
>

Re: nested order limit by percentage of overall records

Posted by Mike Sukmanowsky <mi...@parsely.com>.
Distributed quantiles aren't an easy problem to solve (as you can see from
LinkedIn's source) but perhaps in time they'll be brought into core
functions.  It wasn't until 0.11.0 that date/time functions were brought
into built-in.  Had to use a combination of Piggybank and custom UDFs.


On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg <ma...@zattoo.com> wrote:

> Thanks a lot Mike. This seems to be what I'm looking for ;)
>
> I'm a bit disappointed that what I wanted to achieve isn't possible without
> using any UDF.
>
> Cheers,
> -Marco
>
>
> On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <mi...@parsely.com>
> wrote:
>
> > You should check out the quantile libraries in LinkedIn's DataFu UDFs:
> > https://github.com/linkedin/datafu specifically
> >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
> > relatively small inputs, and
> >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
> > larger inputs.
> >
> > You can use this to receive the top x% for any given field and then you
> can
> > use that within a filter
> >
> >
> > On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:
> >
> > > Hi there,
> > >
> > > I would like to do something very similar to a nested foreach with
> using
> > > order by and then limit. But I would like to limit on a relation to the
> > > total number of records.
> > >
> > > users = load 'users' as (userid:chararray, money:long,
> region:chararray);
> > > grouped_region = group users by region;
> > > top_10_percent = foreach grouped_region {
> > >             sorted = order users by money desc;
> > >             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the
> top
> > > 10% it would be total users/10 in that region.
> > >             generate group, flatten(top);
> > > };
> > >
> > > Thanks a lot for any help on this.
> > >
> > > Cheers,
> > > -Marco
> > >
> >
> >
> >
> > --
> > Mike Sukmanowsky
> >
> > Product Lead, http://parse.ly
> > 989 Avenue of the Americas, 3rd Floor
> > New York, NY  10018
> > p: +1 (416) 953-4248
> > e: mike@parsely.com
> >
>



-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com

Re: nested order limit by percentage of overall records

Posted by Marco Cadetg <ma...@zattoo.com>.
Thanks a lot Mike. This seems to be what I'm looking for ;)

I'm a bit disappointed that what I wanted to achieve isn't possible without
using any UDF.

Cheers,
-Marco


On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <mi...@parsely.com> wrote:

> You should check out the quantile libraries in LinkedIn's DataFu UDFs:
> https://github.com/linkedin/datafu specifically
>
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
> relatively small inputs, and
>
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
> larger inputs.
>
> You can use this to receive the top x% for any given field and then you can
> use that within a filter
>
>
> On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Hi there,
> >
> > I would like to do something very similar to a nested foreach with using
> > order by and then limit. But I would like to limit on a relation to the
> > total number of records.
> >
> > users = load 'users' as (userid:chararray, money:long, region:chararray);
> > grouped_region = group users by region;
> > top_10_percent = foreach grouped_region {
> >             sorted = order users by money desc;
> >             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the top
> > 10% it would be total users/10 in that region.
> >             generate group, flatten(top);
> > };
> >
> > Thanks a lot for any help on this.
> >
> > Cheers,
> > -Marco
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: mike@parsely.com
>

Re: nested order limit by percentage of overall records

Posted by Mike Sukmanowsky <mi...@parsely.com>.
You should check out the quantile libraries in LinkedIn's DataFu UDFs:
https://github.com/linkedin/datafu specifically
https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
relatively small inputs, and
https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
larger inputs.

You can use this to receive the top x% for any given field and then you can
use that within a filter


On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> Hi there,
>
> I would like to do something very similar to a nested foreach with using
> order by and then limit. But I would like to limit on a relation to the
> total number of records.
>
> users = load 'users' as (userid:chararray, money:long, region:chararray);
> grouped_region = group users by region;
> top_10_percent = foreach grouped_region {
>             sorted = order users by money desc;
>             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the top
> 10% it would be total users/10 in that region.
>             generate group, flatten(top);
> };
>
> Thanks a lot for any help on this.
>
> Cheers,
> -Marco
>



-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com