You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mark <st...@gmail.com> on 2012/05/11 02:23:48 UTC

Please help with grouped count

We have logs in the following format

us, foo
us, foo
fr, fizz
us, bar
fr, baz
fr, fizz
us, foo
fr, fizz

Where the first column is a country and the second column is a search term.

How in the world can I output the country followed by the top terms in 
order of occurrence... ie:

us, (foo, bar)      # Top term for 'us' is foo then bar then ...
fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...

Thanks



Re: Please help with grouped count

Posted by Jonathan Coveney <jc...@gmail.com>.
You would have to create one, yeah.

2012/5/11 Mark <st...@gmail.com>

> Not entirely following you on the Order UDF.. doesn one exist or are you
> proposing I create this myself?
>
> Thanks.
>
>
> On 5/11/12 10:49 AM, Jonathan Coveney wrote:
>
>> The nesting and the non-nested are indeed the same. To limit, you could
>> do:
>>
>> b = group a by (country, search_term);
>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>> as ct;
>> d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
>> forget the order of commands, check documentation
>>
>> the ORDER_UDF doesn't exist. It's important to remember that bags have no
>> order, so if you want a TUPLE in term order, you need to take in a bag,
>> and
>> then get the elements, sort them, and put them in a Tuple. If you need
>> help
>> with that, less us know.
>>
>> 2012/5/11 Mark<static.void.dev@gmail.com**>
>>
>>  Also, using your example, how could I limit the number of terms per
>>> country?
>>>
>>>
>>> On 5/11/12 9:47 AM, Mark wrote:
>>>
>>>  Thank you so much, that's pretty much what I was going for but with a
>>>> slightly different output.
>>>>
>>>> Just to be clear... are these equivalent?
>>>>
>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>> as
>>>> (country, search_term), COUNT(a) as ct;
>>>>
>>>>
>>>> b = group a by (country, search_term);
>>>> c = foreach b generate flatten(group) as (country, search_term),
>>>> COUNT(a)
>>>> as ct;
>>>>
>>>> I'm guessing so... I didn't know you could combine/nest these
>>>> statements.
>>>>
>>>>
>>>> After experimenting with your example I'm pretty sure I understand
>>>> everything that's going on. I can work with this format but I was
>>>> wondering
>>>> how would I massage this into something like:
>>>>
>>>> (country1, top term1, topterm2, topterm3, ...)
>>>> (country2, top term1, topterm2, topterm3, ...)
>>>> (country3, top term1, topterm2, topterm3, ...)
>>>>
>>>> Maybe it has to be something like this:
>>>>
>>>> (country1, (top term1, topterm2, topterm3, ...))
>>>>
>>>> So one row per country with the first value being the country and the
>>>> following values the top terms in order? Is this even possible with Pig?
>>>>
>>>> Thanks for the clarification.
>>>>
>>>>
>>>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>>>
>>>>  a = load 'log' as (country:chararray, search_term:chararray);
>>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>>> as
>>>>> (country, search_term), COUNT(a) as ct;
>>>>> c = order b by country asc, ct desc;
>>>>>
>>>>> It sort of depends what format you want the output in, though. Note: if
>>>>> you
>>>>> know that the number of search terms is low you could do this in memory
>>>>> and
>>>>> do it in one m/r job, but this version will be scalable.
>>>>>
>>>>> If this solution doesn't make sense, I can help explain it. It's
>>>>> important
>>>>> to know what format you want the output in. This would give you every
>>>>> country (in ascending alphabetical order), and then the search term and
>>>>> count starting with the highest.
>>>>>
>>>>> 2012/5/10 Mark<static.void.dev@gmail.com****>
>>>>>
>>>>>  We have logs in the following format
>>>>>
>>>>>> us, foo
>>>>>> us, foo
>>>>>> fr, fizz
>>>>>> us, bar
>>>>>> fr, baz
>>>>>> fr, fizz
>>>>>> us, foo
>>>>>> fr, fizz
>>>>>>
>>>>>> Where the first column is a country and the second column is a search
>>>>>> term.
>>>>>>
>>>>>> How in the world can I output the country followed by the top terms in
>>>>>> order of occurrence... ie:
>>>>>>
>>>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Please help with grouped count

Posted by Mark <st...@gmail.com>.
Not entirely following you on the Order UDF.. doesn one exist or are you 
proposing I create this myself?

Thanks.

On 5/11/12 10:49 AM, Jonathan Coveney wrote:
> The nesting and the non-nested are indeed the same. To limit, you could do:
>
> b = group a by (country, search_term);
> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
> as ct;
> d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
> forget the order of commands, check documentation
>
> the ORDER_UDF doesn't exist. It's important to remember that bags have no
> order, so if you want a TUPLE in term order, you need to take in a bag, and
> then get the elements, sort them, and put them in a Tuple. If you need help
> with that, less us know.
>
> 2012/5/11 Mark<st...@gmail.com>
>
>> Also, using your example, how could I limit the number of terms per
>> country?
>>
>>
>> On 5/11/12 9:47 AM, Mark wrote:
>>
>>> Thank you so much, that's pretty much what I was going for but with a
>>> slightly different output.
>>>
>>> Just to be clear... are these equivalent?
>>>
>>> b = foreach (group a by (country, search_term)) generate flatten(group) as
>>> (country, search_term), COUNT(a) as ct;
>>>
>>>
>>> b = group a by (country, search_term);
>>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>>> as ct;
>>>
>>> I'm guessing so... I didn't know you could combine/nest these statements.
>>>
>>>
>>> After experimenting with your example I'm pretty sure I understand
>>> everything that's going on. I can work with this format but I was wondering
>>> how would I massage this into something like:
>>>
>>> (country1, top term1, topterm2, topterm3, ...)
>>> (country2, top term1, topterm2, topterm3, ...)
>>> (country3, top term1, topterm2, topterm3, ...)
>>>
>>> Maybe it has to be something like this:
>>>
>>> (country1, (top term1, topterm2, topterm3, ...))
>>>
>>> So one row per country with the first value being the country and the
>>> following values the top terms in order? Is this even possible with Pig?
>>>
>>> Thanks for the clarification.
>>>
>>>
>>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>>
>>>> a = load 'log' as (country:chararray, search_term:chararray);
>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>> as
>>>> (country, search_term), COUNT(a) as ct;
>>>> c = order b by country asc, ct desc;
>>>>
>>>> It sort of depends what format you want the output in, though. Note: if
>>>> you
>>>> know that the number of search terms is low you could do this in memory
>>>> and
>>>> do it in one m/r job, but this version will be scalable.
>>>>
>>>> If this solution doesn't make sense, I can help explain it. It's
>>>> important
>>>> to know what format you want the output in. This would give you every
>>>> country (in ascending alphabetical order), and then the search term and
>>>> count starting with the highest.
>>>>
>>>> 2012/5/10 Mark<static.void.dev@gmail.com**>
>>>>
>>>>   We have logs in the following format
>>>>> us, foo
>>>>> us, foo
>>>>> fr, fizz
>>>>> us, bar
>>>>> fr, baz
>>>>> fr, fizz
>>>>> us, foo
>>>>> fr, fizz
>>>>>
>>>>> Where the first column is a country and the second column is a search
>>>>> term.
>>>>>
>>>>> How in the world can I output the country followed by the top terms in
>>>>> order of occurrence... ie:
>>>>>
>>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>

Re: Please help with grouped count

Posted by Jonathan Coveney <jc...@gmail.com>.
The nesting and the non-nested are indeed the same. To limit, you could do:

b = group a by (country, search_term);
c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
as ct;
d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
forget the order of commands, check documentation

the ORDER_UDF doesn't exist. It's important to remember that bags have no
order, so if you want a TUPLE in term order, you need to take in a bag, and
then get the elements, sort them, and put them in a Tuple. If you need help
with that, less us know.

2012/5/11 Mark <st...@gmail.com>

> Also, using your example, how could I limit the number of terms per
> country?
>
>
> On 5/11/12 9:47 AM, Mark wrote:
>
>> Thank you so much, that's pretty much what I was going for but with a
>> slightly different output.
>>
>> Just to be clear... are these equivalent?
>>
>> b = foreach (group a by (country, search_term)) generate flatten(group) as
>> (country, search_term), COUNT(a) as ct;
>>
>>
>> b = group a by (country, search_term);
>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>> as ct;
>>
>> I'm guessing so... I didn't know you could combine/nest these statements.
>>
>>
>> After experimenting with your example I'm pretty sure I understand
>> everything that's going on. I can work with this format but I was wondering
>> how would I massage this into something like:
>>
>> (country1, top term1, topterm2, topterm3, ...)
>> (country2, top term1, topterm2, topterm3, ...)
>> (country3, top term1, topterm2, topterm3, ...)
>>
>> Maybe it has to be something like this:
>>
>> (country1, (top term1, topterm2, topterm3, ...))
>>
>> So one row per country with the first value being the country and the
>> following values the top terms in order? Is this even possible with Pig?
>>
>> Thanks for the clarification.
>>
>>
>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>
>>> a = load 'log' as (country:chararray, search_term:chararray);
>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>> as
>>> (country, search_term), COUNT(a) as ct;
>>> c = order b by country asc, ct desc;
>>>
>>> It sort of depends what format you want the output in, though. Note: if
>>> you
>>> know that the number of search terms is low you could do this in memory
>>> and
>>> do it in one m/r job, but this version will be scalable.
>>>
>>> If this solution doesn't make sense, I can help explain it. It's
>>> important
>>> to know what format you want the output in. This would give you every
>>> country (in ascending alphabetical order), and then the search term and
>>> count starting with the highest.
>>>
>>> 2012/5/10 Mark<static.void.dev@gmail.com**>
>>>
>>>  We have logs in the following format
>>>>
>>>> us, foo
>>>> us, foo
>>>> fr, fizz
>>>> us, bar
>>>> fr, baz
>>>> fr, fizz
>>>> us, foo
>>>> fr, fizz
>>>>
>>>> Where the first column is a country and the second column is a search
>>>> term.
>>>>
>>>> How in the world can I output the country followed by the top terms in
>>>> order of occurrence... ie:
>>>>
>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>

Re: Please help with grouped count

Posted by Mark <st...@gmail.com>.
Also, using your example, how could I limit the number of terms per country?

On 5/11/12 9:47 AM, Mark wrote:
> Thank you so much, that's pretty much what I was going for but with a 
> slightly different output.
>
> Just to be clear... are these equivalent?
>
> b = foreach (group a by (country, search_term)) generate 
> flatten(group) as
> (country, search_term), COUNT(a) as ct;
>
>
> b = group a by (country, search_term);
> c = foreach b generate flatten(group) as (country, search_term), 
> COUNT(a) as ct;
>
> I'm guessing so... I didn't know you could combine/nest these statements.
>
>
> After experimenting with your example I'm pretty sure I understand 
> everything that's going on. I can work with this format but I was 
> wondering how would I massage this into something like:
>
> (country1, top term1, topterm2, topterm3, ...)
> (country2, top term1, topterm2, topterm3, ...)
> (country3, top term1, topterm2, topterm3, ...)
>
> Maybe it has to be something like this:
>
> (country1, (top term1, topterm2, topterm3, ...))
>
> So one row per country with the first value being the country and the 
> following values the top terms in order? Is this even possible with Pig?
>
> Thanks for the clarification.
>
>
> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>> a = load 'log' as (country:chararray, search_term:chararray);
>> b = foreach (group a by (country, search_term)) generate 
>> flatten(group) as
>> (country, search_term), COUNT(a) as ct;
>> c = order b by country asc, ct desc;
>>
>> It sort of depends what format you want the output in, though. Note: 
>> if you
>> know that the number of search terms is low you could do this in 
>> memory and
>> do it in one m/r job, but this version will be scalable.
>>
>> If this solution doesn't make sense, I can help explain it. It's 
>> important
>> to know what format you want the output in. This would give you every
>> country (in ascending alphabetical order), and then the search term and
>> count starting with the highest.
>>
>> 2012/5/10 Mark<st...@gmail.com>
>>
>>> We have logs in the following format
>>>
>>> us, foo
>>> us, foo
>>> fr, fizz
>>> us, bar
>>> fr, baz
>>> fr, fizz
>>> us, foo
>>> fr, fizz
>>>
>>> Where the first column is a country and the second column is a 
>>> search term.
>>>
>>> How in the world can I output the country followed by the top terms in
>>> order of occurrence... ie:
>>>
>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>
>>> Thanks
>>>
>>>
>>>

Re: Please help with grouped count

Posted by Mark <st...@gmail.com>.
Thank you so much, that's pretty much what I was going for but with a 
slightly different output.

Just to be clear... are these equivalent?

b = foreach (group a by (country, search_term)) generate flatten(group) as
(country, search_term), COUNT(a) as ct;


b = group a by (country, search_term);
c = foreach b generate flatten(group) as (country, search_term), COUNT(a) as ct;

I'm guessing so... I didn't know you could combine/nest these statements.


After experimenting with your example I'm pretty sure I understand 
everything that's going on. I can work with this format but I was 
wondering how would I massage this into something like:

(country1, top term1, topterm2, topterm3, ...)
(country2, top term1, topterm2, topterm3, ...)
(country3, top term1, topterm2, topterm3, ...)

Maybe it has to be something like this:

(country1, (top term1, topterm2, topterm3, ...))

So one row per country with the first value being the country and the 
following values the top terms in order? Is this even possible with Pig?

Thanks for the clarification.


On 5/10/12 5:32 PM, Jonathan Coveney wrote:
> a = load 'log' as (country:chararray, search_term:chararray);
> b = foreach (group a by (country, search_term)) generate flatten(group) as
> (country, search_term), COUNT(a) as ct;
> c = order b by country asc, ct desc;
>
> It sort of depends what format you want the output in, though. Note: if you
> know that the number of search terms is low you could do this in memory and
> do it in one m/r job, but this version will be scalable.
>
> If this solution doesn't make sense, I can help explain it. It's important
> to know what format you want the output in. This would give you every
> country (in ascending alphabetical order), and then the search term and
> count starting with the highest.
>
> 2012/5/10 Mark<st...@gmail.com>
>
>> We have logs in the following format
>>
>> us, foo
>> us, foo
>> fr, fizz
>> us, bar
>> fr, baz
>> fr, fizz
>> us, foo
>> fr, fizz
>>
>> Where the first column is a country and the second column is a search term.
>>
>> How in the world can I output the country followed by the top terms in
>> order of occurrence... ie:
>>
>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>
>> Thanks
>>
>>
>>

Re: Please help with grouped count

Posted by Jonathan Coveney <jc...@gmail.com>.
a = load 'log' as (country:chararray, search_term:chararray);
b = foreach (group a by (country, search_term)) generate flatten(group) as
(country, search_term), COUNT(a) as ct;
c = order b by country asc, ct desc;

It sort of depends what format you want the output in, though. Note: if you
know that the number of search terms is low you could do this in memory and
do it in one m/r job, but this version will be scalable.

If this solution doesn't make sense, I can help explain it. It's important
to know what format you want the output in. This would give you every
country (in ascending alphabetical order), and then the search term and
count starting with the highest.

2012/5/10 Mark <st...@gmail.com>

> We have logs in the following format
>
> us, foo
> us, foo
> fr, fizz
> us, bar
> fr, baz
> fr, fizz
> us, foo
> fr, fizz
>
> Where the first column is a country and the second column is a search term.
>
> How in the world can I output the country followed by the top terms in
> order of occurrence... ie:
>
> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>
> Thanks
>
>
>