You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mark Stetzer <st...@gmail.com> on 2010/10/08 21:47:31 UTC

Percentage Counts?

I'm trying to count N-gram occurrences as a percentage of total
tuples, and I'm running into a problem that I assume has a simple
solution I'm not thinking of.  My script basically looks like:

log = LOAD blah AS (session_id:chararray, text:chararray...);
ngramed = FOREACH log GENERATE flatten(
org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
grpd = GROUP ngramed BY ngram;
freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
COUNT(ngramed) / X AS percent;
STORE freq INTO 'ngrams';

I'm trying to figure out how I can calculate X so that it represents
the total number of tuples in log.  I could "GROUP ALL log" and get a
count of that, but how do I reference it in my FOREACH statement?

Thanks for any help anyone can provide.

-Mark

Re: Percentage Counts?

Posted by Mark Stetzer <st...@gmail.com>.
I was afraid I'd have to do a join on a constant (using Pig 0.6 at the
moment).  That works wonderfully.  Thanks!

On Fri, Oct 8, 2010 at 6:15 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> In Pig 8, you can generate a one-line relation and later refer to it as a
> scalar:
>
> counts = foreach (group ngramed all) generate COUNT(ngramed);
>
> percents = foreach grpd generate group as ngram, COUNT(ngramed) as count,
> COUNT(ngramed) / (long) counts.total as percent;
>
> In earlier versions, the solution is to do a replicated join on a constant
> (ugly, I know):
> counts = foreach (group ngramed all) generate COUNT(ngramed);
> grpd = join grpd by 1, counts by 1 using "replicated";
> percents = foreach grpd generate grpd::group as ngram, COUNT(grpd::ngramed)
> as count, COUNT(grpd::ngramed) / (long) counts::total as percent;
>
> Untested, may break :)
>
>
> On Fri, Oct 8, 2010 at 12:47 PM, Mark Stetzer <st...@gmail.com> wrote:
>
>> I'm trying to count N-gram occurrences as a percentage of total
>> tuples, and I'm running into a problem that I assume has a simple
>> solution I'm not thinking of.  My script basically looks like:
>>
>> log = LOAD blah AS (session_id:chararray, text:chararray...);
>> ngramed = FOREACH log GENERATE flatten(
>> org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
>> grpd = GROUP ngramed BY ngram;
>> freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
>> COUNT(ngramed) / X AS percent;
>> STORE freq INTO 'ngrams';
>>
>> I'm trying to figure out how I can calculate X so that it represents
>> the total number of tuples in log.  I could "GROUP ALL log" and get a
>> count of that, but how do I reference it in my FOREACH statement?
>>
>> Thanks for any help anyone can provide.
>>
>> -Mark
>>
>

Re: Percentage Counts?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
In Pig 8, you can generate a one-line relation and later refer to it as a
scalar:

counts = foreach (group ngramed all) generate COUNT(ngramed);

percents = foreach grpd generate group as ngram, COUNT(ngramed) as count,
COUNT(ngramed) / (long) counts.total as percent;

In earlier versions, the solution is to do a replicated join on a constant
(ugly, I know):
counts = foreach (group ngramed all) generate COUNT(ngramed);
grpd = join grpd by 1, counts by 1 using "replicated";
percents = foreach grpd generate grpd::group as ngram, COUNT(grpd::ngramed)
as count, COUNT(grpd::ngramed) / (long) counts::total as percent;

Untested, may break :)


On Fri, Oct 8, 2010 at 12:47 PM, Mark Stetzer <st...@gmail.com> wrote:

> I'm trying to count N-gram occurrences as a percentage of total
> tuples, and I'm running into a problem that I assume has a simple
> solution I'm not thinking of.  My script basically looks like:
>
> log = LOAD blah AS (session_id:chararray, text:chararray...);
> ngramed = FOREACH log GENERATE flatten(
> org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
> grpd = GROUP ngramed BY ngram;
> freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
> COUNT(ngramed) / X AS percent;
> STORE freq INTO 'ngrams';
>
> I'm trying to figure out how I can calculate X so that it represents
> the total number of tuples in log.  I could "GROUP ALL log" and get a
> count of that, but how do I reference it in my FOREACH statement?
>
> Thanks for any help anyone can provide.
>
> -Mark
>