You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Michael Lok <fu...@gmail.com> on 2012/01/25 08:19:49 UTC

Distinct by column, generate tuple

Hi folks,

I've got a dataset as below:

10,234324234,NAME 1,3
10,346464646,NAME 1,3
10,438389232,NAME 1,3
20,397383737,NAME 2,4
20,383783234,NAME 2,4
20,387382828,NAME 2,4
20,309323333,NAME 2,4
30,439378283,NAME 3,2
30,010191923,NAME 3,2
40,439837434,NAME 4,4
40,383723443,NAME 4,4
40,100182321,NAME 4,4
40,992173732,NAME 4,4

I'd like to just print out the distinct records by column 1.  Here's
what I have:

A = group FULL by $0;

B = foreach FULL {
        C0 = FULL.$0;
        UC0 = DISTINCT C0;
        generate group, COUNT(UC0);
};

The script above prints out only the first column and count (not
really required).  But I need to print out just a single tuple for
each of the distinct row.

Is this possible?

Any help is greatly appreciated.


Thanks!

Re: Distinct by column, generate tuple

Posted by Michael Lok <fu...@gmail.com>.
Hi Grig,

Thanks for the code.  Works like a charm!

On Fri, Jan 27, 2012 at 6:23 AM, Grig Gheorghiu
<gr...@gmail.com> wrote:
> If you just want to obtain a sample value for column #1 and column #2
> associated with each unique value for column #0, this worked for me on
> your dataset:
>
> TEST = LOAD 'test.txt.gz' USING PigStorage(',') as (
>        id1: chararray,
>        id2: chararray,
>        name: chararray,
>        cnt: chararray);
>
> G = GROUP TEST BY id1;
> R = FOREACH G GENERATE FLATTEN(group), MAX(TEST.id2), MAX(TEST.name);
> DUMP R;
>
> I got
>
> (10,438389232,NAME 1)
> (20,397383737,NAME 2)
> (30,439378283,NAME 3)
> (40,992173732,NAME 4)
>
>
> You can replace MAX with some other aggregate function that works on
> strings or numbers.
>
> Grig
>
>
> On Tue, Jan 24, 2012 at 11:55 PM, Michael Lok <fu...@gmail.com> wrote:
>> Hi Prashant,
>>
>> You're partially correct.  Based on the distinct values of the 1st
>> column, I also need to grab the other columns for each distinct
>> record.  Referring back to my original data set; the output should be:
>>
>> 10,234324234,NAME 1
>> 20,397383737,NAME 2
>> 30,439378283,NAME 3
>> 40,439837434,NAME 4
>>
>> Thanks.
>>
>> On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
>> <pr...@gmail.com> wrote:
>>> Hi Michael,
>>>
>>> If I understand correctly you are trying to get the distinct 1st column
>>> elements from the dataset? Something like this:
>>>
>>> grunt> A = load 'aaa' using PigStorage(',');
>>> grunt> B = foreach A GENERATE $0;
>>> grunt> C = DISTINCT B;
>>> grunt> DUMP C;
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <fu...@gmail.com> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I've got a dataset as below:
>>>>
>>>> 10,234324234,NAME 1,3
>>>> 10,346464646,NAME 1,3
>>>> 10,438389232,NAME 1,3
>>>> 20,397383737,NAME 2,4
>>>> 20,383783234,NAME 2,4
>>>> 20,387382828,NAME 2,4
>>>> 20,309323333,NAME 2,4
>>>> 30,439378283,NAME 3,2
>>>> 30,010191923,NAME 3,2
>>>> 40,439837434,NAME 4,4
>>>> 40,383723443,NAME 4,4
>>>> 40,100182321,NAME 4,4
>>>> 40,992173732,NAME 4,4
>>>>
>>>> I'd like to just print out the distinct records by column 1.  Here's
>>>> what I have:
>>>>
>>>> A = group FULL by $0;
>>>>
>>>> B = foreach FULL {
>>>>        C0 = FULL.$0;
>>>>        UC0 = DISTINCT C0;
>>>>        generate group, COUNT(UC0);
>>>> };
>>>>
>>>> The script above prints out only the first column and count (not
>>>> really required).  But I need to print out just a single tuple for
>>>> each of the distinct row.
>>>>
>>>> Is this possible?
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>>
>>>> Thanks!
>>>>

Re: Distinct by column, generate tuple

Posted by Grig Gheorghiu <gr...@gmail.com>.
If you just want to obtain a sample value for column #1 and column #2
associated with each unique value for column #0, this worked for me on
your dataset:

TEST = LOAD 'test.txt.gz' USING PigStorage(',') as (
	id1: chararray,
	id2: chararray,
	name: chararray,
	cnt: chararray);

G = GROUP TEST BY id1;
R = FOREACH G GENERATE FLATTEN(group), MAX(TEST.id2), MAX(TEST.name);
DUMP R;

I got

(10,438389232,NAME 1)
(20,397383737,NAME 2)
(30,439378283,NAME 3)
(40,992173732,NAME 4)


You can replace MAX with some other aggregate function that works on
strings or numbers.

Grig


On Tue, Jan 24, 2012 at 11:55 PM, Michael Lok <fu...@gmail.com> wrote:
> Hi Prashant,
>
> You're partially correct.  Based on the distinct values of the 1st
> column, I also need to grab the other columns for each distinct
> record.  Referring back to my original data set; the output should be:
>
> 10,234324234,NAME 1
> 20,397383737,NAME 2
> 30,439378283,NAME 3
> 40,439837434,NAME 4
>
> Thanks.
>
> On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
>> Hi Michael,
>>
>> If I understand correctly you are trying to get the distinct 1st column
>> elements from the dataset? Something like this:
>>
>> grunt> A = load 'aaa' using PigStorage(',');
>> grunt> B = foreach A GENERATE $0;
>> grunt> C = DISTINCT B;
>> grunt> DUMP C;
>>
>> Thanks,
>> Prashant
>>
>> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <fu...@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> I've got a dataset as below:
>>>
>>> 10,234324234,NAME 1,3
>>> 10,346464646,NAME 1,3
>>> 10,438389232,NAME 1,3
>>> 20,397383737,NAME 2,4
>>> 20,383783234,NAME 2,4
>>> 20,387382828,NAME 2,4
>>> 20,309323333,NAME 2,4
>>> 30,439378283,NAME 3,2
>>> 30,010191923,NAME 3,2
>>> 40,439837434,NAME 4,4
>>> 40,383723443,NAME 4,4
>>> 40,100182321,NAME 4,4
>>> 40,992173732,NAME 4,4
>>>
>>> I'd like to just print out the distinct records by column 1.  Here's
>>> what I have:
>>>
>>> A = group FULL by $0;
>>>
>>> B = foreach FULL {
>>>        C0 = FULL.$0;
>>>        UC0 = DISTINCT C0;
>>>        generate group, COUNT(UC0);
>>> };
>>>
>>> The script above prints out only the first column and count (not
>>> really required).  But I need to print out just a single tuple for
>>> each of the distinct row.
>>>
>>> Is this possible?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks!
>>>

Re: Distinct by column, generate tuple

Posted by Michael Lok <fu...@gmail.com>.
Hi Prashant,

You're partially correct.  Based on the distinct values of the 1st
column, I also need to grab the other columns for each distinct
record.  Referring back to my original data set; the output should be:

10,234324234,NAME 1
20,397383737,NAME 2
30,439378283,NAME 3
40,439837434,NAME 4

Thanks.

On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
<pr...@gmail.com> wrote:
> Hi Michael,
>
> If I understand correctly you are trying to get the distinct 1st column
> elements from the dataset? Something like this:
>
> grunt> A = load 'aaa' using PigStorage(',');
> grunt> B = foreach A GENERATE $0;
> grunt> C = DISTINCT B;
> grunt> DUMP C;
>
> Thanks,
> Prashant
>
> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <fu...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I've got a dataset as below:
>>
>> 10,234324234,NAME 1,3
>> 10,346464646,NAME 1,3
>> 10,438389232,NAME 1,3
>> 20,397383737,NAME 2,4
>> 20,383783234,NAME 2,4
>> 20,387382828,NAME 2,4
>> 20,309323333,NAME 2,4
>> 30,439378283,NAME 3,2
>> 30,010191923,NAME 3,2
>> 40,439837434,NAME 4,4
>> 40,383723443,NAME 4,4
>> 40,100182321,NAME 4,4
>> 40,992173732,NAME 4,4
>>
>> I'd like to just print out the distinct records by column 1.  Here's
>> what I have:
>>
>> A = group FULL by $0;
>>
>> B = foreach FULL {
>>        C0 = FULL.$0;
>>        UC0 = DISTINCT C0;
>>        generate group, COUNT(UC0);
>> };
>>
>> The script above prints out only the first column and count (not
>> really required).  But I need to print out just a single tuple for
>> each of the distinct row.
>>
>> Is this possible?
>>
>> Any help is greatly appreciated.
>>
>>
>> Thanks!
>>

Re: Distinct by column, generate tuple

Posted by Prashant Kommireddi <pr...@gmail.com>.
Hi Michael,

If I understand correctly you are trying to get the distinct 1st column
elements from the dataset? Something like this:

grunt> A = load 'aaa' using PigStorage(',');
grunt> B = foreach A GENERATE $0;
grunt> C = DISTINCT B;
grunt> DUMP C;

Thanks,
Prashant

On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <fu...@gmail.com> wrote:

> Hi folks,
>
> I've got a dataset as below:
>
> 10,234324234,NAME 1,3
> 10,346464646,NAME 1,3
> 10,438389232,NAME 1,3
> 20,397383737,NAME 2,4
> 20,383783234,NAME 2,4
> 20,387382828,NAME 2,4
> 20,309323333,NAME 2,4
> 30,439378283,NAME 3,2
> 30,010191923,NAME 3,2
> 40,439837434,NAME 4,4
> 40,383723443,NAME 4,4
> 40,100182321,NAME 4,4
> 40,992173732,NAME 4,4
>
> I'd like to just print out the distinct records by column 1.  Here's
> what I have:
>
> A = group FULL by $0;
>
> B = foreach FULL {
>        C0 = FULL.$0;
>        UC0 = DISTINCT C0;
>        generate group, COUNT(UC0);
> };
>
> The script above prints out only the first column and count (not
> really required).  But I need to print out just a single tuple for
> each of the distinct row.
>
> Is this possible?
>
> Any help is greatly appreciated.
>
>
> Thanks!
>