You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/04/11 22:53:19 UTC

DISTINCT with 2 fields in a tuple

I am trying to get distinct from 2 fields in a record. something like
select distinct a, b from c; So I wrote this in pig which is actually not
working. I did:


A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
(FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);

B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}

ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray
...

But this doesn't seem to be working. I thought A is a tuple and form_id and
set_id are fields that I can do DISTINCT on. I saw similar example online
but not exactly same.

Re: DISTINCT with 2 fields in a tuple

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Exactly like you posted.

Cheers,
--
Gianmarco



On Thu, Apr 12, 2012 at 16:55, Mohit Anchlia <mo...@gmail.com> wrote:

> How can I do distinct with foreach? Are those 2 separate statement like the
> one I posted or something different?
>
> On Thu, Apr 12, 2012 at 7:49 AM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Hi,
> >
> > Distinct with the foreach is more efficient then grouping, as long as you
> > don't need the rest of the data you are better off with this solution.
> >
> > With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection,
> > that is you are telling Pig to treat the value as a scalar. The right
> > syntax is the first one (without the "A." in front).
> >
> > Cheers,
> > --
> > Gianmarco
> >
> >
> >
> > On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <mo...@gmail.com>
> > wrote:
> >
> > >  Thanks I tried something like this and it worked, but I have one more
> > > question:
> > >
> > >
> > > grunt> B = foreach A GENERATE FORM_ID, SET_ID;
> > >
> > > grunt> C= DISTINCT B;
> > >
> > > What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
> > > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
> > > results are different.
> > >
> > > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <
> > prash1784@gmail.com
> > > >wrote:
> > >
> > > > You are doing a distinct on a Tuple, and not a Bag?
> > > >
> > > > In your example, DISTINCT on Field name on each record/tuple would
> not
> > > make
> > > > sense as its always a single value. You need to group by on a certain
> > key
> > > > before a distinct.
> > > >
> > > >
> > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <
> mohitanchlia@gmail.com
> > > > >wrote:
> > > >
> > > > > I am trying to get distinct from 2 fields in a record. something
> like
> > > > > select distinct a, b from c; So I wrote this in pig which is
> actually
> > > not
> > > > > working. I did:
> > > > >
> > > > >
> > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t')
> AS
> > > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> > > > >
> > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> > > > >
> > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> > > > chararray
> > > > > ...
> > > > >
> > > > > But this doesn't seem to be working. I thought A is a tuple and
> > form_id
> > > > and
> > > > > set_id are fields that I can do DISTINCT on. I saw similar example
> > > online
> > > > > but not exactly same.
> > > > >
> > > >
> > >
> >
>

Re: DISTINCT with 2 fields in a tuple

Posted by Mohit Anchlia <mo...@gmail.com>.
How can I do distinct with foreach? Are those 2 separate statement like the
one I posted or something different?

On Thu, Apr 12, 2012 at 7:49 AM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> Hi,
>
> Distinct with the foreach is more efficient then grouping, as long as you
> don't need the rest of the data you are better off with this solution.
>
> With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection,
> that is you are telling Pig to treat the value as a scalar. The right
> syntax is the first one (without the "A." in front).
>
> Cheers,
> --
> Gianmarco
>
>
>
> On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
> >  Thanks I tried something like this and it worked, but I have one more
> > question:
> >
> >
> > grunt> B = foreach A GENERATE FORM_ID, SET_ID;
> >
> > grunt> C= DISTINCT B;
> >
> > What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
> > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
> > results are different.
> >
> > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <
> prash1784@gmail.com
> > >wrote:
> >
> > > You are doing a distinct on a Tuple, and not a Bag?
> > >
> > > In your example, DISTINCT on Field name on each record/tuple would not
> > make
> > > sense as its always a single value. You need to group by on a certain
> key
> > > before a distinct.
> > >
> > >
> > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <mohitanchlia@gmail.com
> > > >wrote:
> > >
> > > > I am trying to get distinct from 2 fields in a record. something like
> > > > select distinct a, b from c; So I wrote this in pig which is actually
> > not
> > > > working. I did:
> > > >
> > > >
> > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> > > >
> > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> > > >
> > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> > > chararray
> > > > ...
> > > >
> > > > But this doesn't seem to be working. I thought A is a tuple and
> form_id
> > > and
> > > > set_id are fields that I can do DISTINCT on. I saw similar example
> > online
> > > > but not exactly same.
> > > >
> > >
> >
>

Re: DISTINCT with 2 fields in a tuple

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

Distinct with the foreach is more efficient then grouping, as long as you
don't need the rest of the data you are better off with this solution.

With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection,
that is you are telling Pig to treat the value as a scalar. The right
syntax is the first one (without the "A." in front).

Cheers,
--
Gianmarco



On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <mo...@gmail.com> wrote:

>  Thanks I tried something like this and it worked, but I have one more
> question:
>
>
> grunt> B = foreach A GENERATE FORM_ID, SET_ID;
>
> grunt> C= DISTINCT B;
>
> What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
> foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
> results are different.
>
> On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > You are doing a distinct on a Tuple, and not a Bag?
> >
> > In your example, DISTINCT on Field name on each record/tuple would not
> make
> > sense as its always a single value. You need to group by on a certain key
> > before a distinct.
> >
> >
> > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <mohitanchlia@gmail.com
> > >wrote:
> >
> > > I am trying to get distinct from 2 fields in a record. something like
> > > select distinct a, b from c; So I wrote this in pig which is actually
> not
> > > working. I did:
> > >
> > >
> > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> > >
> > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> > >
> > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> > chararray
> > > ...
> > >
> > > But this doesn't seem to be working. I thought A is a tuple and form_id
> > and
> > > set_id are fields that I can do DISTINCT on. I saw similar example
> online
> > > but not exactly same.
> > >
> >
>

Re: DISTINCT with 2 fields in a tuple

Posted by Mohit Anchlia <mo...@gmail.com>.
 Thanks I tried something like this and it worked, but I have one more
question:


grunt> B = foreach A GENERATE FORM_ID, SET_ID;

grunt> C= DISTINCT B;

What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
results are different.

On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> You are doing a distinct on a Tuple, and not a Bag?
>
> In your example, DISTINCT on Field name on each record/tuple would not make
> sense as its always a single value. You need to group by on a certain key
> before a distinct.
>
>
> On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
>
> > I am trying to get distinct from 2 fields in a record. something like
> > select distinct a, b from c; So I wrote this in pig which is actually not
> > working. I did:
> >
> >
> > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> >
> > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> >
> > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> chararray
> > ...
> >
> > But this doesn't seem to be working. I thought A is a tuple and form_id
> and
> > set_id are fields that I can do DISTINCT on. I saw similar example online
> > but not exactly same.
> >
>

Re: DISTINCT with 2 fields in a tuple

Posted by Prashant Kommireddi <pr...@gmail.com>.
You are doing a distinct on a Tuple, and not a Bag?

In your example, DISTINCT on Field name on each record/tuple would not make
sense as its always a single value. You need to group by on a certain key
before a distinct.


On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I am trying to get distinct from 2 fields in a record. something like
> select distinct a, b from c; So I wrote this in pig which is actually not
> working. I did:
>
>
> A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
>
> B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
>
> ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray
> ...
>
> But this doesn't seem to be working. I thought A is a tuple and form_id and
> set_id are fields that I can do DISTINCT on. I saw similar example online
> but not exactly same.
>

Re: DISTINCT with 2 fields in a tuple

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.
Just group on those 2 fields. The 'group' field of the output will contain all the
distinct combinations. That is, of course, if that is what you wanted to do in the first place.
So no 'DISTINCT' is really necessary.

On Apr 11, 2012, at 1:53 PM, Mohit Anchlia wrote:

> I am trying to get distinct from 2 fields in a record. something like
> select distinct a, b from c; So I wrote this in pig which is actually not
> working. I did:
> 
> 
> A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> 
> B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> 
> ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray
> ...
> 
> But this doesn't seem to be working. I thought A is a tuple and form_id and
> set_id are fields that I can do DISTINCT on. I saw similar example online
> but not exactly same.