You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dave Viner <da...@vinertech.com> on 2010/06/29 17:14:55 UTC

UDF for generating top xx % of results?

Is there a UDF for generating the top X % of results?  For example, in a log
parsing context, it might be the set of search queries that represent the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only takes the
top *number* of results, rather a percentile.

Thanks
Dave Viner

Re: UDF for generating top xx % of results?

Posted by Thejas Nair <te...@yahoo-inc.com>.

On 6/30/10 9:02 AM, "hc busy" <hc...@gmail.com> wrote:

> @Thejas  I had thought that Limit is distributed and does not guarantee u
> get the results in order ??
> 

As mentioned under section on limit here -
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#LIMIT
" There is no guarantee which tuples will be returned, and the tuples that
are returned can change from one run to the next. A particular set of tuples
can be requested using the ORDER operator followed by LIMIT. "

They query -
Set default_parallel 10;
L = load 'x';
O = order L by $0;
LIM = limit O 100;

Will result in 3 MR jobs.  (see explain output for details)
1st MR - sampling MR job for order-by to determine the distribution on sort
key and decide how to partition the data for ordering
2nd MR - orders the result. Each reduce task will output only first 100
records. 
3rd MR - does the final limit - map reads with sort key as the key, it has a
single reducer task that reads the first 100 records.

Re: UDF for generating top xx % of results?

Posted by hc busy <hc...@gmail.com>.

ahh, I see, i seem to have misread the question, if it's top xx% entries,
then certainly sorting and then limiting.

@Thejas  I had thought that Limit is distributed and does not guarantee u
get the results in order ??

@Dave
here's what to do after you get the min + (max-min)*.95 into V which now has
just one row.

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95 as v;
D = cross TABLE, V;
E = filter D by D.x <= V.v;

We should really start a wiki page to store these common calculations in
Pig... I'd imagine a lot of people out there needed to do this at one time
or another.

This is also why, imnsho, I'd like to see cross as a nested_op inside
foreach, because of this type of application.


On Tue, Jun 29, 2010 at 6:42 PM, Dave Viner <da...@pobox.com> wrote:

> Ok.  I think I see how it's working.
>
> As a side note to this conversation, the wiki at
> http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what
> I
> was looking for *should* be possible with this code:
>
> A = load 'data1';
> B = group A all;
> C = foreach B generate COUNT(B);
> store C into 'count';
> D = load 'data2';
> E = foreach D generate $1/GetScalar(C);
>
>
> However, GetScalar is no longer present in 0.7.0.
>
> Thanks for your help.
>
> Dave Viner
>
>
>
> On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
> >
> >
> >
> > On 6/29/10 4:22 PM, "Dave Viner" <da...@pobox.com> wrote:
> >
> > > Hi Thejas,
> > >
> > > Thanks for this suggestion.  Just to make sure I understand the
> strategy
> > > here:
> > >
> > > Pig query 1 outputs 2 files (or directories w/ files in them):
> > > "file.ordered" which contains the data set (query, count) in ascending
> > order
> > > of count; and "total_rows" which contains 1 line with the total number
> of
> > > rows.
> > >
> > > Then, I need a small script that reads the value of "total_rows",
> > multiplies
> > > that value by the desired percentile (0.95 in the example), and invokes
> > pig
> > > a second time, passing the "TOP_ROWS" value as a define on the command
> > line.
> > >  The second pig query outputs the desired file in "output" which
> contains
> > > only those rows which have a count above the given threshold.
> > >
> >
> >
> > Yes, you are right. But the "output" contains 0.95 * total_num_rows, it
> is
> > not all rows that have a value of count above given threshold. So it is
> > possible that a few rows that are not in output have same value of count
> as
> > the last result in "output" .
> >  For example if input has 100 rows with 1 as the count value, you will
> any
> > 95 rows in the output.
> >
> > -Thejas
> >
> >
> >
> > > Is that correct?
> > >
> > > Thanks
> > > Dave Viner
> > >
> > >
> > >
> > > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <te...@yahoo-inc.com>
> > wrote:
> > >
> > >> This pig query will not give you the "top xx% of results" . Ie if xx =
> > 50%
> > >> and results = 200, you may not get 100 rows in output.
> > >> If you really want "top xx% of results", you need to do something like
> > the
> > >> following -
> > >> (if you just want rows that have "count < (max(count) -
> > min(count))*0.95,
> > >> you can store and use the value of V the way TOTAL_ROWS is stored in
> > this
> > >> example)
> > >>
> > >> Pig query 1
> > >> -----------
> > >> L = load 'file' as (query, count); -- assuming that count is already
> > there
> > >> for simplicity
> > >> O = order L by count; -- order it by column that defines 'top'
> > >> G = group L by 'all';
> > >> TOTAL_ROWS = foreach  G generate COUNT(G);
> > >> store O into 'file.ordered';
> > >>
> > >>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to
> > filter
> > >> the results
> > >> store TOTAL_ROWS into 'total_rows.txt';
> > >>
> > >>
> > >>
> > >> Pig query 2
> > >> -----------
> > >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
> > is
> > >> the result stored in  total_rows.txt
> > >> L = load 'file.ordered';
> > >> LIM = limit L, $TOP_ROWS;
> > >> Store LIM into 'output';
> > >>
> > >>
> > >>
> > >> On 6/29/10 3:24 PM, "Dave Viner" <da...@pobox.com> wrote:
> > >>
> > >>> Actually, I've gotten the first half of the code to work now.
> Here's
> > >> how
> > >>> it looks:
> > >>>
> > >>>
> > >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > >>> (user:chararray, time:long, query:chararray);
> > >>> Z = FILTER X BY query is not null;
> > >>> A = GROUP Z BY query;
> > >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> > count:long;
> > >>> C = GROUP B ALL;
> > >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> > >> max:long;
> > >>> V = FOREACH U GENERATE min + (max-min)*0.95;
> > >>>
> > >>> V is appropriately set... but I can't perform the final step of
> > actually
> > >>> filtering the values by that count.
> > >>>
> > >>> The simple approach:
> > >>> TOP = FILTER B BY count > V;
> > >>>
> > >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> > >>> {query: chararray,count: long}
> > >>>
> > >>> Same with SPLIT.
> > >>>
> > >>> How do I filter or split on the value of V?
> > >>>
> > >>> Dave Viner
> > >>>
> > >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com>
> > wrote:
> > >>>
> > >>>> I don't quite understand this pig latin.  The piggybank
> > >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2
> > parameters
> > >>>> which are compared.  Here's the sample I'm trying using the tutorial
> > >>>> excitelog as a sample.
> > >>>>
> > >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > >>>> (user:chararray, time:long, query:chararray);
> > >>>> Z = FILTER X BY query is not null;
> > >>>> A = GROUP Z BY query;
> > >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> > count:long;
> > >>>> U = FOREACH B GENERATE *,
> > >>>>     MIN(count) as min:long,
> > >>>>     MAX(count) as max:long;
> > >>>>
> > >>>> This doesn't seem to work at all.  It dies with this error:
> > >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema:
> double.
> > >>>> Other Field Schema: min: long
> > >>>>
> > >>>> Changing the min:long and max:long to doubles (as suggested by the
> > error
> > >>>> message), causes this error:
> > >>>> ERROR 1045: Could not infer the matching function for
> > >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please
> use
> > >> an
> > >>>> explicit cast.
> > >>>>
> > >>>> What am I missing in using the sample code you've provided?  I can't
> > >> seem
> > >>>> to get it to work...
> > >>>>
> > >>>> Thanks for your help.
> > >>>> Dave Viner
> > >>>>
> > >>>>
> > >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com>
> wrote:
> > >>>>
> > >>>>> That's what I tried to say in my last email.I don't believe you can
> > >>>>> calculate exactly the percentiles in just one pass. Writing out the
> > pig
> > >>>>> for
> > >>>>> two pass algorithm should be easy enough..
> > >>>>>
> > >>>>> P = group TABLE all;
> > >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> > >>>>> V = foreach U generate min + (max-min)*0.95;
> > >>>>>
> > >>>>> would give you the 95th percentile cutoff, and u just filter or
> split
> > >> by
> > >>>>> V.
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com>
> > >> wrote:
> > >>>>>
> > >>>>>> How would I calculate the percentile in one pass?  In order to
> > >> calculate
> > >>>>>> the
> > >>>>>> percentile for each item, I need to know the total count.  How do
> I
> > >> get
> > >>>>> the
> > >>>>>> total count, and then calculate each item's percentile in one
> pass?
> > >>>>>>
> > >>>>>> I don't mind doing multiple passes - I am just not sure how to
> make
> > >> the
> > >>>>>> calculation.
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>> Dave Viner
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>>> I think it's impossible to do this within one M/R. You will want
> to
> > >>>>>>> implement it in two M/R in Pig, because you have to calculate the
> > >>>>>>> percentile
> > >>>>>>> in pass 1, and then perform the filter in pass 2.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Is there a UDF for generating the top X % of results?  For
> > example,
> > >>>>> in
> > >>>>>> a
> > >>>>>>>> log
> > >>>>>>>> parsing context, it might be the set of search queries that
> > >>>>> represent
> > >>>>>> the
> > >>>>>>>> top 80% of all queries.
> > >>>>>>>>
> > >>>>>>>> I see in the piggybank that there is a TOP function, but that
> only
> > >>>>>> takes
> > >>>>>>>> the
> > >>>>>>>> top *number* of results, rather a percentile.
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>> Dave Viner
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

Ok.  I think I see how it's working.

As a side note to this conversation, the wiki at
http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what I
was looking for *should* be possible with this code:

A = load 'data1';
B = group A all;
C = foreach B generate COUNT(B);
store C into 'count';
D = load 'data2';
E = foreach D generate $1/GetScalar(C);


However, GetScalar is no longer present in 0.7.0.

Thanks for your help.

Dave Viner



On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

>
>
>
> On 6/29/10 4:22 PM, "Dave Viner" <da...@pobox.com> wrote:
>
> > Hi Thejas,
> >
> > Thanks for this suggestion.  Just to make sure I understand the strategy
> > here:
> >
> > Pig query 1 outputs 2 files (or directories w/ files in them):
> > "file.ordered" which contains the data set (query, count) in ascending
> order
> > of count; and "total_rows" which contains 1 line with the total number of
> > rows.
> >
> > Then, I need a small script that reads the value of "total_rows",
> multiplies
> > that value by the desired percentile (0.95 in the example), and invokes
> pig
> > a second time, passing the "TOP_ROWS" value as a define on the command
> line.
> >  The second pig query outputs the desired file in "output" which contains
> > only those rows which have a count above the given threshold.
> >
>
>
> Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
> not all rows that have a value of count above given threshold. So it is
> possible that a few rows that are not in output have same value of count as
> the last result in "output" .
>  For example if input has 100 rows with 1 as the count value, you will any
> 95 rows in the output.
>
> -Thejas
>
>
>
> > Is that correct?
> >
> > Thanks
> > Dave Viner
> >
> >
> >
> > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <te...@yahoo-inc.com>
> wrote:
> >
> >> This pig query will not give you the "top xx% of results" . Ie if xx =
> 50%
> >> and results = 200, you may not get 100 rows in output.
> >> If you really want "top xx% of results", you need to do something like
> the
> >> following -
> >> (if you just want rows that have "count < (max(count) -
> min(count))*0.95,
> >> you can store and use the value of V the way TOTAL_ROWS is stored in
> this
> >> example)
> >>
> >> Pig query 1
> >> -----------
> >> L = load 'file' as (query, count); -- assuming that count is already
> there
> >> for simplicity
> >> O = order L by count; -- order it by column that defines 'top'
> >> G = group L by 'all';
> >> TOTAL_ROWS = foreach  G generate COUNT(G);
> >> store O into 'file.ordered';
> >>
> >>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to
> filter
> >> the results
> >> store TOTAL_ROWS into 'total_rows.txt';
> >>
> >>
> >>
> >> Pig query 2
> >> -----------
> >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
> is
> >> the result stored in  total_rows.txt
> >> L = load 'file.ordered';
> >> LIM = limit L, $TOP_ROWS;
> >> Store LIM into 'output';
> >>
> >>
> >>
> >> On 6/29/10 3:24 PM, "Dave Viner" <da...@pobox.com> wrote:
> >>
> >>> Actually, I've gotten the first half of the code to work now.   Here's
> >> how
> >>> it looks:
> >>>
> >>>
> >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >>> (user:chararray, time:long, query:chararray);
> >>> Z = FILTER X BY query is not null;
> >>> A = GROUP Z BY query;
> >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> count:long;
> >>> C = GROUP B ALL;
> >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> >> max:long;
> >>> V = FOREACH U GENERATE min + (max-min)*0.95;
> >>>
> >>> V is appropriately set... but I can't perform the final step of
> actually
> >>> filtering the values by that count.
> >>>
> >>> The simple approach:
> >>> TOP = FILTER B BY count > V;
> >>>
> >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> >>> {query: chararray,count: long}
> >>>
> >>> Same with SPLIT.
> >>>
> >>> How do I filter or split on the value of V?
> >>>
> >>> Dave Viner
> >>>
> >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com>
> wrote:
> >>>
> >>>> I don't quite understand this pig latin.  The piggybank
> >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2
> parameters
> >>>> which are compared.  Here's the sample I'm trying using the tutorial
> >>>> excitelog as a sample.
> >>>>
> >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >>>> (user:chararray, time:long, query:chararray);
> >>>> Z = FILTER X BY query is not null;
> >>>> A = GROUP Z BY query;
> >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> count:long;
> >>>> U = FOREACH B GENERATE *,
> >>>>     MIN(count) as min:long,
> >>>>     MAX(count) as max:long;
> >>>>
> >>>> This doesn't seem to work at all.  It dies with this error:
> >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> >>>> Other Field Schema: min: long
> >>>>
> >>>> Changing the min:long and max:long to doubles (as suggested by the
> error
> >>>> message), causes this error:
> >>>> ERROR 1045: Could not infer the matching function for
> >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
> >> an
> >>>> explicit cast.
> >>>>
> >>>> What am I missing in using the sample code you've provided?  I can't
> >> seem
> >>>> to get it to work...
> >>>>
> >>>> Thanks for your help.
> >>>> Dave Viner
> >>>>
> >>>>
> >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
> >>>>
> >>>>> That's what I tried to say in my last email.I don't believe you can
> >>>>> calculate exactly the percentiles in just one pass. Writing out the
> pig
> >>>>> for
> >>>>> two pass algorithm should be easy enough..
> >>>>>
> >>>>> P = group TABLE all;
> >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >>>>> V = foreach U generate min + (max-min)*0.95;
> >>>>>
> >>>>> would give you the 95th percentile cutoff, and u just filter or split
> >> by
> >>>>> V.
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com>
> >> wrote:
> >>>>>
> >>>>>> How would I calculate the percentile in one pass?  In order to
> >> calculate
> >>>>>> the
> >>>>>> percentile for each item, I need to know the total count.  How do I
> >> get
> >>>>> the
> >>>>>> total count, and then calculate each item's percentile in one pass?
> >>>>>>
> >>>>>> I don't mind doing multiple passes - I am just not sure how to make
> >> the
> >>>>>> calculation.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dave Viner
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
> >>>>>>
> >>>>>>> I think it's impossible to do this within one M/R. You will want to
> >>>>>>> implement it in two M/R in Pig, because you have to calculate the
> >>>>>>> percentile
> >>>>>>> in pass 1, and then perform the filter in pass 2.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Is there a UDF for generating the top X % of results?  For
> example,
> >>>>> in
> >>>>>> a
> >>>>>>>> log
> >>>>>>>> parsing context, it might be the set of search queries that
> >>>>> represent
> >>>>>> the
> >>>>>>>> top 80% of all queries.
> >>>>>>>>
> >>>>>>>> I see in the piggybank that there is a TOP function, but that only
> >>>>>> takes
> >>>>>>>> the
> >>>>>>>> top *number* of results, rather a percentile.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Dave Viner
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: UDF for generating top xx % of results?

Posted by Thejas Nair <te...@yahoo-inc.com>.



On 6/29/10 4:22 PM, "Dave Viner" <da...@pobox.com> wrote:

> Hi Thejas,
> 
> Thanks for this suggestion.  Just to make sure I understand the strategy
> here:
> 
> Pig query 1 outputs 2 files (or directories w/ files in them):
> "file.ordered" which contains the data set (query, count) in ascending order
> of count; and "total_rows" which contains 1 line with the total number of
> rows.
> 
> Then, I need a small script that reads the value of "total_rows", multiplies
> that value by the desired percentile (0.95 in the example), and invokes pig
> a second time, passing the "TOP_ROWS" value as a define on the command line.
>  The second pig query outputs the desired file in "output" which contains
> only those rows which have a count above the given threshold.
> 


Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
 For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.

-Thejas



> Is that correct?
> 
> Thanks
> Dave Viner
> 
> 
> 
> On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
> 
>> This pig query will not give you the "top xx% of results" . Ie if xx = 50%
>> and results = 200, you may not get 100 rows in output.
>> If you really want "top xx% of results", you need to do something like the
>> following -
>> (if you just want rows that have "count < (max(count) - min(count))*0.95,
>> you can store and use the value of V the way TOTAL_ROWS is stored in this
>> example)
>> 
>> Pig query 1
>> -----------
>> L = load 'file' as (query, count); -- assuming that count is already there
>> for simplicity
>> O = order L by count; -- order it by column that defines 'top'
>> G = group L by 'all';
>> TOTAL_ROWS = foreach  G generate COUNT(G);
>> store O into 'file.ordered';
>> 
>>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
>> the results
>> store TOTAL_ROWS into 'total_rows.txt';
>> 
>> 
>> 
>> Pig query 2
>> -----------
>> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
>> the result stored in  total_rows.txt
>> L = load 'file.ordered';
>> LIM = limit L, $TOP_ROWS;
>> Store LIM into 'output';
>> 
>> 
>> 
>> On 6/29/10 3:24 PM, "Dave Viner" <da...@pobox.com> wrote:
>> 
>>> Actually, I've gotten the first half of the code to work now.   Here's
>> how
>>> it looks:
>>> 
>>> 
>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>> (user:chararray, time:long, query:chararray);
>>> Z = FILTER X BY query is not null;
>>> A = GROUP Z BY query;
>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>> C = GROUP B ALL;
>>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
>> max:long;
>>> V = FOREACH U GENERATE min + (max-min)*0.95;
>>> 
>>> V is appropriately set... but I can't perform the final step of actually
>>> filtering the values by that count.
>>> 
>>> The simple approach:
>>> TOP = FILTER B BY count > V;
>>> 
>>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
>>> {query: chararray,count: long}
>>> 
>>> Same with SPLIT.
>>> 
>>> How do I filter or split on the value of V?
>>> 
>>> Dave Viner
>>> 
>>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:
>>> 
>>>> I don't quite understand this pig latin.  The piggybank
>>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
>>>> which are compared.  Here's the sample I'm trying using the tutorial
>>>> excitelog as a sample.
>>>> 
>>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>>> (user:chararray, time:long, query:chararray);
>>>> Z = FILTER X BY query is not null;
>>>> A = GROUP Z BY query;
>>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>>> U = FOREACH B GENERATE *,
>>>>     MIN(count) as min:long,
>>>>     MAX(count) as max:long;
>>>> 
>>>> This doesn't seem to work at all.  It dies with this error:
>>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
>>>> Other Field Schema: min: long
>>>> 
>>>> Changing the min:long and max:long to doubles (as suggested by the error
>>>> message), causes this error:
>>>> ERROR 1045: Could not infer the matching function for
>>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
>> an
>>>> explicit cast.
>>>> 
>>>> What am I missing in using the sample code you've provided?  I can't
>> seem
>>>> to get it to work...
>>>> 
>>>> Thanks for your help.
>>>> Dave Viner
>>>> 
>>>> 
>>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
>>>> 
>>>>> That's what I tried to say in my last email.I don't believe you can
>>>>> calculate exactly the percentiles in just one pass. Writing out the pig
>>>>> for
>>>>> two pass algorithm should be easy enough..
>>>>> 
>>>>> P = group TABLE all;
>>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>>>>> V = foreach U generate min + (max-min)*0.95;
>>>>> 
>>>>> would give you the 95th percentile cutoff, and u just filter or split
>> by
>>>>> V.
>>>>> 
>>>>> 
>>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com>
>> wrote:
>>>>> 
>>>>>> How would I calculate the percentile in one pass?  In order to
>> calculate
>>>>>> the
>>>>>> percentile for each item, I need to know the total count.  How do I
>> get
>>>>> the
>>>>>> total count, and then calculate each item's percentile in one pass?
>>>>>> 
>>>>>> I don't mind doing multiple passes - I am just not sure how to make
>> the
>>>>>> calculation.
>>>>>> 
>>>>>> Thanks
>>>>>> Dave Viner
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
>>>>>> 
>>>>>>> I think it's impossible to do this within one M/R. You will want to
>>>>>>> implement it in two M/R in Pig, because you have to calculate the
>>>>>>> percentile
>>>>>>> in pass 1, and then perform the filter in pass 2.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Is there a UDF for generating the top X % of results?  For example,
>>>>> in
>>>>>> a
>>>>>>>> log
>>>>>>>> parsing context, it might be the set of search queries that
>>>>> represent
>>>>>> the
>>>>>>>> top 80% of all queries.
>>>>>>>> 
>>>>>>>> I see in the piggybank that there is a TOP function, but that only
>>>>>> takes
>>>>>>>> the
>>>>>>>> top *number* of results, rather a percentile.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dave Viner
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

Hi Thejas,

Thanks for this suggestion.  Just to make sure I understand the strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number of
rows.

Then, I need a small script that reads the value of "total_rows", multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
 The second pig query outputs the desired file in "output" which contains
only those rows which have a count above the given threshold.

Is that correct?

Thanks
Dave Viner



On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

> This pig query will not give you the "top xx% of results" . Ie if xx = 50%
> and results = 200, you may not get 100 rows in output.
> If you really want "top xx% of results", you need to do something like the
> following -
> (if you just want rows that have "count < (max(count) - min(count))*0.95,
> you can store and use the value of V the way TOTAL_ROWS is stored in this
> example)
>
> Pig query 1
> -----------
> L = load 'file' as (query, count); -- assuming that count is already there
> for simplicity
> O = order L by count; -- order it by column that defines 'top'
> G = group L by 'all';
> TOTAL_ROWS = foreach  G generate COUNT(G);
> store O into 'file.ordered';
>
>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
> the results
> store TOTAL_ROWS into 'total_rows.txt';
>
>
>
> Pig query 2
> -----------
> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
> the result stored in  total_rows.txt
> L = load 'file.ordered';
> LIM = limit L, $TOP_ROWS;
> Store LIM into 'output';
>
>
>
> On 6/29/10 3:24 PM, "Dave Viner" <da...@pobox.com> wrote:
>
> > Actually, I've gotten the first half of the code to work now.   Here's
> how
> > it looks:
> >
> >
> > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > (user:chararray, time:long, query:chararray);
> > Z = FILTER X BY query is not null;
> > A = GROUP Z BY query;
> > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> > C = GROUP B ALL;
> > U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> max:long;
> > V = FOREACH U GENERATE min + (max-min)*0.95;
> >
> > V is appropriately set... but I can't perform the final step of actually
> > filtering the values by that count.
> >
> > The simple approach:
> > TOP = FILTER B BY count > V;
> >
> > doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> > {query: chararray,count: long}
> >
> > Same with SPLIT.
> >
> > How do I filter or split on the value of V?
> >
> > Dave Viner
> >
> > On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:
> >
> >> I don't quite understand this pig latin.  The piggybank
> >> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> >> which are compared.  Here's the sample I'm trying using the tutorial
> >> excitelog as a sample.
> >>
> >> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >> (user:chararray, time:long, query:chararray);
> >> Z = FILTER X BY query is not null;
> >> A = GROUP Z BY query;
> >> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> >> U = FOREACH B GENERATE *,
> >>     MIN(count) as min:long,
> >>     MAX(count) as max:long;
> >>
> >> This doesn't seem to work at all.  It dies with this error:
> >> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> >> Other Field Schema: min: long
> >>
> >> Changing the min:long and max:long to doubles (as suggested by the error
> >> message), causes this error:
> >> ERROR 1045: Could not infer the matching function for
> >> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
> an
> >> explicit cast.
> >>
> >> What am I missing in using the sample code you've provided?  I can't
> seem
> >> to get it to work...
> >>
> >> Thanks for your help.
> >> Dave Viner
> >>
> >>
> >> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
> >>
> >>> That's what I tried to say in my last email.I don't believe you can
> >>> calculate exactly the percentiles in just one pass. Writing out the pig
> >>> for
> >>> two pass algorithm should be easy enough..
> >>>
> >>> P = group TABLE all;
> >>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >>> V = foreach U generate min + (max-min)*0.95;
> >>>
> >>> would give you the 95th percentile cutoff, and u just filter or split
> by
> >>> V.
> >>>
> >>>
> >>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com>
> wrote:
> >>>
> >>>> How would I calculate the percentile in one pass?  In order to
> calculate
> >>>> the
> >>>> percentile for each item, I need to know the total count.  How do I
> get
> >>> the
> >>>> total count, and then calculate each item's percentile in one pass?
> >>>>
> >>>> I don't mind doing multiple passes - I am just not sure how to make
> the
> >>>> calculation.
> >>>>
> >>>> Thanks
> >>>> Dave Viner
> >>>>
> >>>>
> >>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
> >>>>
> >>>>> I think it's impossible to do this within one M/R. You will want to
> >>>>> implement it in two M/R in Pig, because you have to calculate the
> >>>>> percentile
> >>>>> in pass 1, and then perform the filter in pass 2.
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
> >>> wrote:
> >>>>>
> >>>>>> Is there a UDF for generating the top X % of results?  For example,
> >>> in
> >>>> a
> >>>>>> log
> >>>>>> parsing context, it might be the set of search queries that
> >>> represent
> >>>> the
> >>>>>> top 80% of all queries.
> >>>>>>
> >>>>>> I see in the piggybank that there is a TOP function, but that only
> >>>> takes
> >>>>>> the
> >>>>>> top *number* of results, rather a percentile.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dave Viner
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

Hi Aniket,

Is it possible to pass an alias to a UDF?  In other words could I do:

TOP = FILTER B BY myCompare(count, V)

or something similar?

I don't quite understand your idea of using STORE.  If I do:

STORE V into 'threshold';
XX = LOAD 'threshold' AS (threshold:long);

How can I filter the count by XX?  Seems like I'm back in the same
situation.

Do you mean doing the STORE(), then directly modifying the files themselves
before doing the LOAD() so that it contains one row for each value of B,
with the threshold value V appended to the end?

Dave Viner

On Tue, Jun 29, 2010 at 3:57 PM, Aniket Mokashi <am...@andrew.cmu.edu>wrote:

> Hi Dave,
>
> TOP = FILTER B BY count > V; would not work right now with pig. Alias can
> be
> casted into scalars after https://issues.apache.org/jira/browse/PIG-1434is
> added. As far as I know, there is no simple way of using values generated
> by
> mapreduce jobs directly into Pig jobs, you will need UDF to do that.
> One dirty approach is to store this value in a location with a store and
> then later use it to generate a new file with $0 and V and filter with $0 >
> $1.
>
> Thanks,
> Aniket
>
> -----Original Message-----
> From: daveviner@gmail.com [mailto:daveviner@gmail.com] On Behalf Of Dave
> Viner
> Sent: Tuesday, June 29, 2010 3:24 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: UDF for generating top xx % of results?
>
> Actually, I've gotten the first half of the code to work now.   Here's how
> it looks:
>
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> C = GROUP B ALL;
> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
> V = FOREACH U GENERATE min + (max-min)*0.95;
>
> V is appropriately set... but I can't perform the final step of actually
> filtering the values by that count.
>
> The simple approach:
> TOP = FILTER B BY count > V;
>
> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> {query: chararray,count: long}
>
> Same with SPLIT.
>
> How do I filter or split on the value of V?
>
> Dave Viner
>
> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:
>
> > I don't quite understand this pig latin.  The piggybank
> > function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> > which are compared.  Here's the sample I'm trying using the tutorial
> > excitelog as a sample.
> >
> > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > (user:chararray, time:long, query:chararray);
> > Z = FILTER X BY query is not null;
> > A = GROUP Z BY query;
> > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> > U = FOREACH B GENERATE *,
> >     MIN(count) as min:long,
> >     MAX(count) as max:long;
> >
> > This doesn't seem to work at all.  It dies with this error:
> > ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> > Other Field Schema: min: long
> >
> > Changing the min:long and max:long to doubles (as suggested by the error
> > message), causes this error:
> > ERROR 1045: Could not infer the matching function for
> > org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> > explicit cast.
> >
> > What am I missing in using the sample code you've provided?  I can't seem
> > to get it to work...
> >
> > Thanks for your help.
> > Dave Viner
> >
> >
> > On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
> >
> >> That's what I tried to say in my last email.I don't believe you can
> >> calculate exactly the percentiles in just one pass. Writing out the pig
> >> for
> >> two pass algorithm should be easy enough..
> >>
> >> P = group TABLE all;
> >> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >> V = foreach U generate min + (max-min)*0.95;
> >>
> >> would give you the 95th percentile cutoff, and u just filter or split by
> >> V.
> >>
> >>
> >> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com>
> wrote:
> >>
> >> > How would I calculate the percentile in one pass?  In order to
> calculate
> >> > the
> >> > percentile for each item, I need to know the total count.  How do I
> get
> >> the
> >> > total count, and then calculate each item's percentile in one pass?
> >> >
> >> > I don't mind doing multiple passes - I am just not sure how to make
> the
> >> > calculation.
> >> >
> >> > Thanks
> >> > Dave Viner
> >> >
> >> >
> >> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
> >> >
> >> > > I think it's impossible to do this within one M/R. You will want to
> >> > > implement it in two M/R in Pig, because you have to calculate the
> >> > > percentile
> >> > > in pass 1, and then perform the filter in pass 2.
> >> > >
> >> > >
> >> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
> >> wrote:
> >> > >
> >> > > > Is there a UDF for generating the top X % of results?  For
> example,
> >> in
> >> > a
> >> > > > log
> >> > > > parsing context, it might be the set of search queries that
> >> represent
> >> > the
> >> > > > top 80% of all queries.
> >> > > >
> >> > > > I see in the piggybank that there is a TOP function, but that only
> >> > takes
> >> > > > the
> >> > > > top *number* of results, rather a percentile.
> >> > > >
> >> > > > Thanks
> >> > > > Dave Viner
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

RE: UDF for generating top xx % of results?

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.

Hi Dave,

TOP = FILTER B BY count > V; would not work right now with pig. Alias can be
casted into scalars after https://issues.apache.org/jira/browse/PIG-1434 is
added. As far as I know, there is no simple way of using values generated by
mapreduce jobs directly into Pig jobs, you will need UDF to do that.
One dirty approach is to store this value in a location with a store and
then later use it to generate a new file with $0 and V and filter with $0 >
$1.

Thanks,
Aniket

-----Original Message-----
From: daveviner@gmail.com [mailto:daveviner@gmail.com] On Behalf Of Dave
Viner
Sent: Tuesday, June 29, 2010 3:24 PM
To: pig-user@hadoop.apache.org
Subject: Re: UDF for generating top xx % of results?

Actually, I've gotten the first half of the code to work now.   Here's how
it looks:


X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner

On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:

> I don't quite understand this pig latin.  The piggybank
> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> which are compared.  Here's the sample I'm trying using the tutorial
> excitelog as a sample.
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> U = FOREACH B GENERATE *,
>     MIN(count) as min:long,
>     MAX(count) as max:long;
>
> This doesn't seem to work at all.  It dies with this error:
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> Other Field Schema: min: long
>
> Changing the min:long and max:long to doubles (as suggested by the error
> message), causes this error:
> ERROR 1045: Could not infer the matching function for
> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> explicit cast.
>
> What am I missing in using the sample code you've provided?  I can't seem
> to get it to work...
>
> Thanks for your help.
> Dave Viner
>
>
> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
>
>> That's what I tried to say in my last email.I don't believe you can
>> calculate exactly the percentiles in just one pass. Writing out the pig
>> for
>> two pass algorithm should be easy enough..
>>
>> P = group TABLE all;
>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>> V = foreach U generate min + (max-min)*0.95;
>>
>> would give you the 95th percentile cutoff, and u just filter or split by
>> V.
>>
>>
>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com> wrote:
>>
>> > How would I calculate the percentile in one pass?  In order to
calculate
>> > the
>> > percentile for each item, I need to know the total count.  How do I get
>> the
>> > total count, and then calculate each item's percentile in one pass?
>> >
>> > I don't mind doing multiple passes - I am just not sure how to make the
>> > calculation.
>> >
>> > Thanks
>> > Dave Viner
>> >
>> >
>> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
>> >
>> > > I think it's impossible to do this within one M/R. You will want to
>> > > implement it in two M/R in Pig, because you have to calculate the
>> > > percentile
>> > > in pass 1, and then perform the filter in pass 2.
>> > >
>> > >
>> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
>> wrote:
>> > >
>> > > > Is there a UDF for generating the top X % of results?  For example,
>> in
>> > a
>> > > > log
>> > > > parsing context, it might be the set of search queries that
>> represent
>> > the
>> > > > top 80% of all queries.
>> > > >
>> > > > I see in the piggybank that there is a TOP function, but that only
>> > takes
>> > > > the
>> > > > top *number* of results, rather a percentile.
>> > > >
>> > > > Thanks
>> > > > Dave Viner
>> > > >
>> > >
>> >
>>
>
>

Re: UDF for generating top xx % of results?

Posted by Thejas Nair <te...@yahoo-inc.com>.

This pig query will not give you the "top xx% of results" . Ie if xx = 50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like the
following -
(if you just want rows that have "count < (max(count) - min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already there
for simplicity 
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach  G generate COUNT(G);
store O into 'file.ordered';

 -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
the results 
store TOTAL_ROWS into 'total_rows.txt';



Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
the result stored in  total_rows.txt
L = load 'file.ordered';
LIM = limit L, $TOP_ROWS;
Store LIM into 'output';



On 6/29/10 3:24 PM, "Dave Viner" <da...@pobox.com> wrote:

> Actually, I've gotten the first half of the code to work now.   Here's how
> it looks:
> 
> 
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> C = GROUP B ALL;
> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
> V = FOREACH U GENERATE min + (max-min)*0.95;
> 
> V is appropriately set... but I can't perform the final step of actually
> filtering the values by that count.
> 
> The simple approach:
> TOP = FILTER B BY count > V;
> 
> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> {query: chararray,count: long}
> 
> Same with SPLIT.
> 
> How do I filter or split on the value of V?
> 
> Dave Viner
> 
> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:
> 
>> I don't quite understand this pig latin.  The piggybank
>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
>> which are compared.  Here's the sample I'm trying using the tutorial
>> excitelog as a sample.
>> 
>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>> (user:chararray, time:long, query:chararray);
>> Z = FILTER X BY query is not null;
>> A = GROUP Z BY query;
>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>> U = FOREACH B GENERATE *,
>>     MIN(count) as min:long,
>>     MAX(count) as max:long;
>> 
>> This doesn't seem to work at all.  It dies with this error:
>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
>> Other Field Schema: min: long
>> 
>> Changing the min:long and max:long to doubles (as suggested by the error
>> message), causes this error:
>> ERROR 1045: Could not infer the matching function for
>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
>> explicit cast.
>> 
>> What am I missing in using the sample code you've provided?  I can't seem
>> to get it to work...
>> 
>> Thanks for your help.
>> Dave Viner
>> 
>> 
>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
>> 
>>> That's what I tried to say in my last email.I don't believe you can
>>> calculate exactly the percentiles in just one pass. Writing out the pig
>>> for
>>> two pass algorithm should be easy enough..
>>> 
>>> P = group TABLE all;
>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>>> V = foreach U generate min + (max-min)*0.95;
>>> 
>>> would give you the 95th percentile cutoff, and u just filter or split by
>>> V.
>>> 
>>> 
>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com> wrote:
>>> 
>>>> How would I calculate the percentile in one pass?  In order to calculate
>>>> the
>>>> percentile for each item, I need to know the total count.  How do I get
>>> the
>>>> total count, and then calculate each item's percentile in one pass?
>>>> 
>>>> I don't mind doing multiple passes - I am just not sure how to make the
>>>> calculation.
>>>> 
>>>> Thanks
>>>> Dave Viner
>>>> 
>>>> 
>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
>>>> 
>>>>> I think it's impossible to do this within one M/R. You will want to
>>>>> implement it in two M/R in Pig, because you have to calculate the
>>>>> percentile
>>>>> in pass 1, and then perform the filter in pass 2.
>>>>> 
>>>>> 
>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
>>> wrote:
>>>>> 
>>>>>> Is there a UDF for generating the top X % of results?  For example,
>>> in
>>>> a
>>>>>> log
>>>>>> parsing context, it might be the set of search queries that
>>> represent
>>>> the
>>>>>> top 80% of all queries.
>>>>>> 
>>>>>> I see in the piggybank that there is a TOP function, but that only
>>>> takes
>>>>>> the
>>>>>> top *number* of results, rather a percentile.
>>>>>> 
>>>>>> Thanks
>>>>>> Dave Viner
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

Actually, I've gotten the first half of the code to work now.   Here's how
it looks:


X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner

On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <da...@pobox.com> wrote:

> I don't quite understand this pig latin.  The piggybank
> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> which are compared.  Here's the sample I'm trying using the tutorial
> excitelog as a sample.
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> U = FOREACH B GENERATE *,
>     MIN(count) as min:long,
>     MAX(count) as max:long;
>
> This doesn't seem to work at all.  It dies with this error:
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> Other Field Schema: min: long
>
> Changing the min:long and max:long to doubles (as suggested by the error
> message), causes this error:
> ERROR 1045: Could not infer the matching function for
> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> explicit cast.
>
> What am I missing in using the sample code you've provided?  I can't seem
> to get it to work...
>
> Thanks for your help.
> Dave Viner
>
>
> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:
>
>> That's what I tried to say in my last email.I don't believe you can
>> calculate exactly the percentiles in just one pass. Writing out the pig
>> for
>> two pass algorithm should be easy enough..
>>
>> P = group TABLE all;
>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>> V = foreach U generate min + (max-min)*0.95;
>>
>> would give you the 95th percentile cutoff, and u just filter or split by
>> V.
>>
>>
>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com> wrote:
>>
>> > How would I calculate the percentile in one pass?  In order to calculate
>> > the
>> > percentile for each item, I need to know the total count.  How do I get
>> the
>> > total count, and then calculate each item's percentile in one pass?
>> >
>> > I don't mind doing multiple passes - I am just not sure how to make the
>> > calculation.
>> >
>> > Thanks
>> > Dave Viner
>> >
>> >
>> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
>> >
>> > > I think it's impossible to do this within one M/R. You will want to
>> > > implement it in two M/R in Pig, because you have to calculate the
>> > > percentile
>> > > in pass 1, and then perform the filter in pass 2.
>> > >
>> > >
>> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
>> wrote:
>> > >
>> > > > Is there a UDF for generating the top X % of results?  For example,
>> in
>> > a
>> > > > log
>> > > > parsing context, it might be the set of search queries that
>> represent
>> > the
>> > > > top 80% of all queries.
>> > > >
>> > > > I see in the piggybank that there is a TOP function, but that only
>> > takes
>> > > > the
>> > > > top *number* of results, rather a percentile.
>> > > >
>> > > > Thanks
>> > > > Dave Viner
>> > > >
>> > >
>> >
>>
>
>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

I don't quite understand this pig latin.  The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared.  Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

This doesn't seem to work at all.  It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. Other
Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided?  I can't seem to
get it to work...

Thanks for your help.
Dave Viner


On Tue, Jun 29, 2010 at 10:17 AM, hc busy <hc...@gmail.com> wrote:

> That's what I tried to say in my last email.I don't believe you can
> calculate exactly the percentiles in just one pass. Writing out the pig for
> two pass algorithm should be easy enough..
>
> P = group TABLE all;
> U = foreach P generate MIN(x) as min, MAX(x) as max;
> V = foreach U generate min + (max-min)*0.95;
>
> would give you the 95th percentile cutoff, and u just filter or split by V.
>
>
> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com> wrote:
>
> > How would I calculate the percentile in one pass?  In order to calculate
> > the
> > percentile for each item, I need to know the total count.  How do I get
> the
> > total count, and then calculate each item's percentile in one pass?
> >
> > I don't mind doing multiple passes - I am just not sure how to make the
> > calculation.
> >
> > Thanks
> > Dave Viner
> >
> >
> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
> >
> > > I think it's impossible to do this within one M/R. You will want to
> > > implement it in two M/R in Pig, because you have to calculate the
> > > percentile
> > > in pass 1, and then perform the filter in pass 2.
> > >
> > >
> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com>
> wrote:
> > >
> > > > Is there a UDF for generating the top X % of results?  For example,
> in
> > a
> > > > log
> > > > parsing context, it might be the set of search queries that represent
> > the
> > > > top 80% of all queries.
> > > >
> > > > I see in the piggybank that there is a TOP function, but that only
> > takes
> > > > the
> > > > top *number* of results, rather a percentile.
> > > >
> > > > Thanks
> > > > Dave Viner
> > > >
> > >
> >
>

Re: UDF for generating top xx % of results?

Posted by hc busy <hc...@gmail.com>.

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by V.


On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <da...@pobox.com> wrote:

> How would I calculate the percentile in one pass?  In order to calculate
> the
> percentile for each item, I need to know the total count.  How do I get the
> total count, and then calculate each item's percentile in one pass?
>
> I don't mind doing multiple passes - I am just not sure how to make the
> calculation.
>
> Thanks
> Dave Viner
>
>
> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:
>
> > I think it's impossible to do this within one M/R. You will want to
> > implement it in two M/R in Pig, because you have to calculate the
> > percentile
> > in pass 1, and then perform the filter in pass 2.
> >
> >
> > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com> wrote:
> >
> > > Is there a UDF for generating the top X % of results?  For example, in
> a
> > > log
> > > parsing context, it might be the set of search queries that represent
> the
> > > top 80% of all queries.
> > >
> > > I see in the piggybank that there is a TOP function, but that only
> takes
> > > the
> > > top *number* of results, rather a percentile.
> > >
> > > Thanks
> > > Dave Viner
> > >
> >
>

Re: UDF for generating top xx % of results?

Posted by Dave Viner <da...@pobox.com>.

How would I calculate the percentile in one pass?  In order to calculate the
percentile for each item, I need to know the total count.  How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy <hc...@gmail.com> wrote:

> I think it's impossible to do this within one M/R. You will want to
> implement it in two M/R in Pig, because you have to calculate the
> percentile
> in pass 1, and then perform the filter in pass 2.
>
>
> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com> wrote:
>
> > Is there a UDF for generating the top X % of results?  For example, in a
> > log
> > parsing context, it might be the set of search queries that represent the
> > top 80% of all queries.
> >
> > I see in the piggybank that there is a TOP function, but that only takes
> > the
> > top *number* of results, rather a percentile.
> >
> > Thanks
> > Dave Viner
> >
>

Re: UDF for generating top xx % of results?

Posted by hc busy <hc...@gmail.com>.

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <da...@vinertech.com> wrote:

> Is there a UDF for generating the top X % of results?  For example, in a
> log
> parsing context, it might be the set of search queries that represent the
> top 80% of all queries.
>
> I see in the piggybank that there is a TOP function, but that only takes
> the
> top *number* of results, rather a percentile.
>
> Thanks
> Dave Viner
>