You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tony Burton <TB...@SportingIndex.com> on 2011/07/01 12:11:50 UTC

selecting DISTINCT using a subset of fields

Hello

My dataset has five fields, I want to select DISTINCT lines based upon the first four fields and then append the fifth field from the first common line (based on the first four fields). Is this possible using Pig? I have read on the Pig latin Reference Manual 2 page "You cannot use DISTINCT on a subset of fields. To do this, use FOREACH...GENERATE to select the fields, and then use DISTINCT (see Example: Nested Block<http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#nestedblock>)" but I'm not sure how to adapt this example to my problem. Attempts so far - not using the nested FOREACH syntax - have resulted in either (1) duplicate lines repeated with their own previous fifth column or (2) duplicate lines repeated with the same fifth column; ie in each case, there's no DISTINCT-ness about the data based on fields 1-4.

Is there a way to do this, or should I create a UDF that operates on lines GROUPed by columns 1-4?

Thanks,

Tony


**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

RE: selecting DISTINCT using a subset of fields

Posted by Tony Burton <TB...@SportingIndex.com>.
Hi - thanks for the suggestion. Exactly what I was I looking for.

Tony


-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: 01 July 2011 13:20
To: user@pig.apache.org
Subject: Re: selecting DISTINCT using a subset of fields

foo = group my_relation by (a, b, c, d) parallel $p;
bar = foreach foo {
  first_e = limit my_relation.e 1;
  generate flatten(group) as (a, b, c, d), flatten(first_e) as e;
}

That should do the trick.

On Fri, Jul 1, 2011 at 3:11 AM, Tony Burton <TB...@sportingindex.com>wrote:

> Hello
>
> My dataset has five fields, I want to select DISTINCT lines based upon the
> first four fields and then append the fifth field from the first common line
> (based on the first four fields). Is this possible using Pig? I have read on
> the Pig latin Reference Manual 2 page "You cannot use DISTINCT on a subset
> of fields. To do this, use FOREACH...GENERATE to select the fields, and then
> use DISTINCT (see Example: Nested Block<
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#nestedblock>)" but
> I'm not sure how to adapt this example to my problem. Attempts so far - not
> using the nested FOREACH syntax - have resulted in either (1) duplicate
> lines repeated with their own previous fifth column or (2) duplicate lines
> repeated with the same fifth column; ie in each case, there's no
> DISTINCT-ness about the data based on fields 1-4.
>
> Is there a way to do this, or should I create a UDF that operates on lines
> GROUPed by columns 1-4?
>
> Thanks,
>
> Tony
>
>
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and
> may be legally privileged.  If you are not the intended recipient, then the
> dissemination or copying of this email is prohibited. If you have received
> this in error, please notify the sender by replying by email and then delete
> the email completely from your system.  Neither Sporting Index nor the
> sender accepts responsibility for any virus, or any other defect which might
> affect any computer or IT system into which the email is received and/or
> opened.  It is the responsibility of the recipient to scan the email and no
> responsibility is accepted for any loss or damage arising in any way from
> receipt or use of this email.  Sporting Index Ltd is a company registered in
> England and Wales with company number 2636842, whose registered office is at
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting
> Index Ltd is authorised and regulated by the UK Financial Services Authority
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: selecting DISTINCT using a subset of fields

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
foo = group my_relation by (a, b, c, d) parallel $p;
bar = foreach foo {
  first_e = limit my_relation.e 1;
  generate flatten(group) as (a, b, c, d), flatten(first_e) as e;
}

That should do the trick.

On Fri, Jul 1, 2011 at 3:11 AM, Tony Burton <TB...@sportingindex.com>wrote:

> Hello
>
> My dataset has five fields, I want to select DISTINCT lines based upon the
> first four fields and then append the fifth field from the first common line
> (based on the first four fields). Is this possible using Pig? I have read on
> the Pig latin Reference Manual 2 page "You cannot use DISTINCT on a subset
> of fields. To do this, use FOREACH...GENERATE to select the fields, and then
> use DISTINCT (see Example: Nested Block<
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#nestedblock>)" but
> I'm not sure how to adapt this example to my problem. Attempts so far - not
> using the nested FOREACH syntax - have resulted in either (1) duplicate
> lines repeated with their own previous fifth column or (2) duplicate lines
> repeated with the same fifth column; ie in each case, there's no
> DISTINCT-ness about the data based on fields 1-4.
>
> Is there a way to do this, or should I create a UDF that operates on lines
> GROUPed by columns 1-4?
>
> Thanks,
>
> Tony
>
>
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and
> may be legally privileged.  If you are not the intended recipient, then the
> dissemination or copying of this email is prohibited. If you have received
> this in error, please notify the sender by replying by email and then delete
> the email completely from your system.  Neither Sporting Index nor the
> sender accepts responsibility for any virus, or any other defect which might
> affect any computer or IT system into which the email is received and/or
> opened.  It is the responsibility of the recipient to scan the email and no
> responsibility is accepted for any loss or damage arising in any way from
> receipt or use of this email.  Sporting Index Ltd is a company registered in
> England and Wales with company number 2636842, whose registered office is at
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting
> Index Ltd is authorised and regulated by the UK Financial Services Authority
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
>