You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Arun Ahuja <aa...@gmail.com> on 2012/09/19 19:09:35 UTC

Counting elements in a bag

Looking for an elegant way to do this:

Suppose there is a bag with names { James, John, Lisa, Larry, Amanda,
Amanda, John, James, Lisa, John}
I'd like to get something back along the lines of a tuple (2, 2, 3, 1,
2) where those are the counts for Amanda, James, John, Larry, Lisa
respectively.

Obviously I could write a UDF to do this, but I want to ensure that
there are the same columns in every row i.e. Bag { Amanda }  gives me
(1, 0, 0, 0.. ).  I could precompute the possible bag entries and pass
that along to the UDF but is this the only possibility?  Anything
better?

Thanks,

Arun

Re: Counting elements in a bag

Posted by Arun Ahuja <aa...@gmail.com>.

Right, but my problem is a little bit different.

My input is more along the lines of:

1: {James, John, Lisa, Larry, Amanda,Amanda, John, James, Lisa, John}
2: {Amanda, Lisa, Lisa}

and with output:

(2, 2, 3, 1,2)
(1, 0, 0, 0,2)

I've done it for now as a UDF where I precompute the full set of names
and pass it as an argument to the function.

On Thu, Sep 20, 2012 at 6:00 PM, Ruslan Al-Fakikh <me...@gmail.com> wrote:
> Sorry,
>
> I meant:
> or just
> c = foreach b generate COUNT(a); --without group
> to eliminate the keys
>
> On Thu, Sep 20, 2012 at 1:37 PM, Ruslan Al-Fakikh <me...@gmail.com> wrote:
>> Hey, try this:
>>
>> [cloudera@localhost workpig]$ cat input
>> James
>> John
>> Lisa
>> Larry
>> Amanda
>> Amanda
>> John
>> James
>> Lisa
>> John
>> [cloudera@localhost workpig]$ pig -x local
>> 2012-09-20 13:35:06,225 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: /home/cloudera/workpig/pig_1348133706198.log
>> 2012-09-20 13:35:06,524 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: file:///
>> grunt> a = load 'input';
>> grunt> b = group a by $0;
>> grunt> c = foreach b generate group, COUNT(a);
>> grunt> dump c;
>> (John,3)
>> (Lisa,2)
>> (James,2)
>> (Larry,1)
>> (Amanda,2)
>>
>> or just
>> c = foreach b generate group, COUNT(a);
>> to eliminate the keys
>>
>> Best regards,
>> Ruslan
>>
>> On Wed, Sep 19, 2012 at 9:09 PM, Arun Ahuja <aa...@gmail.com> wrote:
>>> Looking for an elegant way to do this:
>>>
>>> Suppose there is a bag with names { James, John, Lisa, Larry, Amanda,
>>> Amanda, John, James, Lisa, John}
>>> I'd like to get something back along the lines of a tuple (2, 2, 3, 1,
>>> 2) where those are the counts for Amanda, James, John, Larry, Lisa
>>> respectively.
>>>
>>> Obviously I could write a UDF to do this, but I want to ensure that
>>> there are the same columns in every row i.e. Bag { Amanda }  gives me
>>> (1, 0, 0, 0.. ).  I could precompute the possible bag entries and pass
>>> that along to the UDF but is this the only possibility?  Anything
>>> better?
>>>
>>> Thanks,
>>>
>>> Arun

Re: Counting elements in a bag

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Sorry,

I meant:
or just
c = foreach b generate COUNT(a); --without group
to eliminate the keys

On Thu, Sep 20, 2012 at 1:37 PM, Ruslan Al-Fakikh <me...@gmail.com> wrote:
> Hey, try this:
>
> [cloudera@localhost workpig]$ cat input
> James
> John
> Lisa
> Larry
> Amanda
> Amanda
> John
> James
> Lisa
> John
> [cloudera@localhost workpig]$ pig -x local
> 2012-09-20 13:35:06,225 [main] INFO  org.apache.pig.Main - Logging
> error messages to: /home/cloudera/workpig/pig_1348133706198.log
> 2012-09-20 13:35:06,524 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: file:///
> grunt> a = load 'input';
> grunt> b = group a by $0;
> grunt> c = foreach b generate group, COUNT(a);
> grunt> dump c;
> (John,3)
> (Lisa,2)
> (James,2)
> (Larry,1)
> (Amanda,2)
>
> or just
> c = foreach b generate group, COUNT(a);
> to eliminate the keys
>
> Best regards,
> Ruslan
>
> On Wed, Sep 19, 2012 at 9:09 PM, Arun Ahuja <aa...@gmail.com> wrote:
>> Looking for an elegant way to do this:
>>
>> Suppose there is a bag with names { James, John, Lisa, Larry, Amanda,
>> Amanda, John, James, Lisa, John}
>> I'd like to get something back along the lines of a tuple (2, 2, 3, 1,
>> 2) where those are the counts for Amanda, James, John, Larry, Lisa
>> respectively.
>>
>> Obviously I could write a UDF to do this, but I want to ensure that
>> there are the same columns in every row i.e. Bag { Amanda }  gives me
>> (1, 0, 0, 0.. ).  I could precompute the possible bag entries and pass
>> that along to the UDF but is this the only possibility?  Anything
>> better?
>>
>> Thanks,
>>
>> Arun

Re: Counting elements in a bag

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Hey, try this:

[cloudera@localhost workpig]$ cat input
James
John
Lisa
Larry
Amanda
Amanda
John
James
Lisa
John
[cloudera@localhost workpig]$ pig -x local
2012-09-20 13:35:06,225 [main] INFO  org.apache.pig.Main - Logging
error messages to: /home/cloudera/workpig/pig_1348133706198.log
2012-09-20 13:35:06,524 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
grunt> a = load 'input';
grunt> b = group a by $0;
grunt> c = foreach b generate group, COUNT(a);
grunt> dump c;
(John,3)
(Lisa,2)
(James,2)
(Larry,1)
(Amanda,2)

or just
c = foreach b generate group, COUNT(a);
to eliminate the keys

Best regards,
Ruslan

On Wed, Sep 19, 2012 at 9:09 PM, Arun Ahuja <aa...@gmail.com> wrote:
> Looking for an elegant way to do this:
>
> Suppose there is a bag with names { James, John, Lisa, Larry, Amanda,
> Amanda, John, James, Lisa, John}
> I'd like to get something back along the lines of a tuple (2, 2, 3, 1,
> 2) where those are the counts for Amanda, James, John, Larry, Lisa
> respectively.
>
> Obviously I could write a UDF to do this, but I want to ensure that
> there are the same columns in every row i.e. Bag { Amanda }  gives me
> (1, 0, 0, 0.. ).  I could precompute the possible bag entries and pass
> that along to the UDF but is this the only possibility?  Anything
> better?
>
> Thanks,
>
> Arun