You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "Rodriguez, John" <jr...@verisign.com> on 2010/08/01 16:48:43 UTC

RE: dereference bag of tuples of fields

Does this mean there is no way to access the fields t1, t2, t3?

 

cat data

{(1,1,1)}

{(2,2,2)(3,3,3)}

{(4,4,4)(5,5,5)(6,6,6)}

A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});

 

 

From: Scott Carey [mailto:scott@richrelevance.com] 
Sent: Saturday, July 31, 2010 9:39 AM
To: pig-user@hadoop.apache.org; Rodriguez, John
Subject: Re: dereference bag of tuples of fields

 

data.isValid

All bags are bags of tuples.  The tuple is intrinsic and invisible at
the syntax level - its visible to udfs though.  If you nest one more
tuple in that nested tuple pig gets confused.    So 'bag.field' is
actually a double dereference - one for the bag and one for the
intrinsic tuple.

----- Reply message -----
From: "Rodriguez, John" <jr...@verisign.com>
Date: Fri, Jul 30, 2010 3:11 pm
Subject: dereference bag of tuples of fields
To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>

I have built a bag tuples where the tuples contain fields. 

 

I am reading SequenceFiles and have reading MyLoader to do this. I
created a subset of all the fields, "isValid" to make the example
simpler.

 

I am not sure how to apply a dereference operator to this?

 

A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
MyLoader() AS (data: bag{t: tuple(isValid:int)});

DESCRIBE A;

A: {data: {t: (isValid: int)}}

 

So all the ways that I have tried to dereference have syntax errors.

 

B = GROUP A BY (data.t);

2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
access to the elements of the tuple in the bag is allowed.

 

B = GROUP A BY (data.t.isValid);

2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
access to the elements of the tuple in the bag is allowed.

 

B = GROUP A BY (t.isValid);

2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
(isValid: int)}}

 

What is the proper way to do this?

 

John Rodriguez

Re: dereference bag of tuples of fields

Posted by Xiaomeng Wan <sh...@gmail.com>.

try FLATTEN the loaded bags first.

On Mon, Aug 2, 2010 at 1:35 PM, Rodriguez, John <jr...@verisign.com> wrote:
> And an expression like in the GENERATE does not work with a bag dereference.
>
> grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
> grunt> C = FOREACH A GENERATE B.t1 - B.t2;
> grunt> dump C;
> 2010-08-02 19:32:41,226 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: In alias C, incompatible types in Subtract Operator left hand side:bag right hand side:bag
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com]
> Sent: Sunday, August 01, 2010 12:19 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: dereference bag of tuples of fields
>
> If you are loading data through PigStorage (which will be used if you
> dont specify any) then there should be a comma separating tuples in
> the bag, so your data should look like
>
> cat data
> {(1,1,1)}
> {(2,2,2),(3,3,3)}
> {(4,4,4),(5,5,5),(6,6,6)}
>
> then
> grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
> grunt> C = foreach A generate B.t1, B.t2, B.t3;
> grunt> dump C;
>
> {(1)},{(1)},{(1)})
> ({(2),(3)},{(2),(3)},{(2),(3)})
> ({(4),(5),(6)},{(4),(5),(6)},{(4),(5),(6)})
>
>
> Ashutosh
> On Sun, Aug 1, 2010 at 07:48, Rodriguez, John <jr...@verisign.com> wrote:
>> Does this mean there is no way to access the fields t1, t2, t3?
>>
>>
>>
>> cat data
>>
>> {(1,1,1)}
>>
>> {(2,2,2)(3,3,3)}
>>
>> {(4,4,4)(5,5,5)(6,6,6)}
>>
>> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
>>
>>
>>
>>
>>
>> From: Scott Carey [mailto:scott@richrelevance.com]
>> Sent: Saturday, July 31, 2010 9:39 AM
>> To: pig-user@hadoop.apache.org; Rodriguez, John
>> Subject: Re: dereference bag of tuples of fields
>>
>>
>>
>> data.isValid
>>
>> All bags are bags of tuples.  The tuple is intrinsic and invisible at
>> the syntax level - its visible to udfs though.  If you nest one more
>> tuple in that nested tuple pig gets confused.    So 'bag.field' is
>> actually a double dereference - one for the bag and one for the
>> intrinsic tuple.
>>
>> ----- Reply message -----
>> From: "Rodriguez, John" <jr...@verisign.com>
>> Date: Fri, Jul 30, 2010 3:11 pm
>> Subject: dereference bag of tuples of fields
>> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>>
>> I have built a bag tuples where the tuples contain fields.
>>
>>
>>
>> I am reading SequenceFiles and have reading MyLoader to do this. I
>> created a subset of all the fields, "isValid" to make the example
>> simpler.
>>
>>
>>
>> I am not sure how to apply a dereference operator to this?
>>
>>
>>
>> A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
>> MyLoader() AS (data: bag{t: tuple(isValid:int)});
>>
>> DESCRIBE A;
>>
>> A: {data: {t: (isValid: int)}}
>>
>>
>>
>> So all the ways that I have tried to dereference have syntax errors.
>>
>>
>>
>> B = GROUP A BY (data.t);
>>
>> 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
>> access to the elements of the tuple in the bag is allowed.
>>
>>
>>
>> B = GROUP A BY (data.t.isValid);
>>
>> 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
>> access to the elements of the tuple in the bag is allowed.
>>
>>
>>
>> B = GROUP A BY (t.isValid);
>>
>> 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
>> (isValid: int)}}
>>
>>
>>
>> What is the proper way to do this?
>>
>>
>>
>> John Rodriguez
>>
>>
>>
>>
>

RE: dereference bag of tuples of fields

Posted by "Rodriguez, John" <jr...@verisign.com>.

And an expression like in the GENERATE does not work with a bag dereference.

grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
grunt> C = FOREACH A GENERATE B.t1 - B.t2;
grunt> dump C;
2010-08-02 19:32:41,226 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: In alias C, incompatible types in Subtract Operator left hand side:bag right hand side:bag

-----Original Message-----
From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com] 
Sent: Sunday, August 01, 2010 12:19 PM
To: pig-user@hadoop.apache.org
Subject: Re: dereference bag of tuples of fields

If you are loading data through PigStorage (which will be used if you
dont specify any) then there should be a comma separating tuples in
the bag, so your data should look like

cat data
{(1,1,1)}
{(2,2,2),(3,3,3)}
{(4,4,4),(5,5,5),(6,6,6)}

then
grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
grunt> C = foreach A generate B.t1, B.t2, B.t3;
grunt> dump C;

{(1)},{(1)},{(1)})
({(2),(3)},{(2),(3)},{(2),(3)})
({(4),(5),(6)},{(4),(5),(6)},{(4),(5),(6)})


Ashutosh
On Sun, Aug 1, 2010 at 07:48, Rodriguez, John <jr...@verisign.com> wrote:
> Does this mean there is no way to access the fields t1, t2, t3?
>
>
>
> cat data
>
> {(1,1,1)}
>
> {(2,2,2)(3,3,3)}
>
> {(4,4,4)(5,5,5)(6,6,6)}
>
> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
>
>
>
>
>
> From: Scott Carey [mailto:scott@richrelevance.com]
> Sent: Saturday, July 31, 2010 9:39 AM
> To: pig-user@hadoop.apache.org; Rodriguez, John
> Subject: Re: dereference bag of tuples of fields
>
>
>
> data.isValid
>
> All bags are bags of tuples.  The tuple is intrinsic and invisible at
> the syntax level - its visible to udfs though.  If you nest one more
> tuple in that nested tuple pig gets confused.    So 'bag.field' is
> actually a double dereference - one for the bag and one for the
> intrinsic tuple.
>
> ----- Reply message -----
> From: "Rodriguez, John" <jr...@verisign.com>
> Date: Fri, Jul 30, 2010 3:11 pm
> Subject: dereference bag of tuples of fields
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>
> I have built a bag tuples where the tuples contain fields.
>
>
>
> I am reading SequenceFiles and have reading MyLoader to do this. I
> created a subset of all the fields, "isValid" to make the example
> simpler.
>
>
>
> I am not sure how to apply a dereference operator to this?
>
>
>
> A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
> MyLoader() AS (data: bag{t: tuple(isValid:int)});
>
> DESCRIBE A;
>
> A: {data: {t: (isValid: int)}}
>
>
>
> So all the ways that I have tried to dereference have syntax errors.
>
>
>
> B = GROUP A BY (data.t);
>
> 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (data.t.isValid);
>
> 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (t.isValid);
>
> 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
> (isValid: int)}}
>
>
>
> What is the proper way to do this?
>
>
>
> John Rodriguez
>
>
>
>

RE: dereference bag of tuples of fields

Posted by "Rodriguez, John" <jr...@verisign.com>.

Thanks all, for your help.

The "generate" syntax you showed now works.

But if I do this:

X = GROUP A BY B.t1;
DUMP X;

Then I get the following error:

2010-08-02 09:48:54,064 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1068: Using Bag as key not supported.

Pig Stack Trace
---------------
ERROR 1068: Using Bag as key not supported.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open i
terator for alias X
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:5
44)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript
Parser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
ava:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
ava:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unabl
e to store alias X
        at org.apache.pig.PigServer.store(PigServer.java:577)
        at org.apache.pig.PigServer.openIterator(PigServer.java:504)
        ... 6 more

-----Original Message-----
From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com] 
Sent: Sunday, August 01, 2010 12:19 PM
To: pig-user@hadoop.apache.org
Subject: Re: dereference bag of tuples of fields

If you are loading data through PigStorage (which will be used if you
dont specify any) then there should be a comma separating tuples in
the bag, so your data should look like

cat data
{(1,1,1)}
{(2,2,2),(3,3,3)}
{(4,4,4),(5,5,5),(6,6,6)}

then
grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
grunt> C = foreach A generate B.t1, B.t2, B.t3;
grunt> dump C;

{(1)},{(1)},{(1)})
({(2),(3)},{(2),(3)},{(2),(3)})
({(4),(5),(6)},{(4),(5),(6)},{(4),(5),(6)})


Ashutosh
On Sun, Aug 1, 2010 at 07:48, Rodriguez, John <jr...@verisign.com> wrote:
> Does this mean there is no way to access the fields t1, t2, t3?
>
>
>
> cat data
>
> {(1,1,1)}
>
> {(2,2,2)(3,3,3)}
>
> {(4,4,4)(5,5,5)(6,6,6)}
>
> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
>
>
>
>
>
> From: Scott Carey [mailto:scott@richrelevance.com]
> Sent: Saturday, July 31, 2010 9:39 AM
> To: pig-user@hadoop.apache.org; Rodriguez, John
> Subject: Re: dereference bag of tuples of fields
>
>
>
> data.isValid
>
> All bags are bags of tuples.  The tuple is intrinsic and invisible at
> the syntax level - its visible to udfs though.  If you nest one more
> tuple in that nested tuple pig gets confused.    So 'bag.field' is
> actually a double dereference - one for the bag and one for the
> intrinsic tuple.
>
> ----- Reply message -----
> From: "Rodriguez, John" <jr...@verisign.com>
> Date: Fri, Jul 30, 2010 3:11 pm
> Subject: dereference bag of tuples of fields
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>
> I have built a bag tuples where the tuples contain fields.
>
>
>
> I am reading SequenceFiles and have reading MyLoader to do this. I
> created a subset of all the fields, "isValid" to make the example
> simpler.
>
>
>
> I am not sure how to apply a dereference operator to this?
>
>
>
> A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
> MyLoader() AS (data: bag{t: tuple(isValid:int)});
>
> DESCRIBE A;
>
> A: {data: {t: (isValid: int)}}
>
>
>
> So all the ways that I have tried to dereference have syntax errors.
>
>
>
> B = GROUP A BY (data.t);
>
> 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (data.t.isValid);
>
> 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (t.isValid);
>
> 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
> (isValid: int)}}
>
>
>
> What is the proper way to do this?
>
>
>
> John Rodriguez
>
>
>
>

Re: dereference bag of tuples of fields

Posted by Ashutosh Chauhan <as...@gmail.com>.

If you are loading data through PigStorage (which will be used if you
dont specify any) then there should be a comma separating tuples in
the bag, so your data should look like

cat data
{(1,1,1)}
{(2,2,2),(3,3,3)}
{(4,4,4),(5,5,5),(6,6,6)}

then
grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
grunt> C = foreach A generate B.t1, B.t2, B.t3;
grunt> dump C;

{(1)},{(1)},{(1)})
({(2),(3)},{(2),(3)},{(2),(3)})
({(4),(5),(6)},{(4),(5),(6)},{(4),(5),(6)})


Ashutosh
On Sun, Aug 1, 2010 at 07:48, Rodriguez, John <jr...@verisign.com> wrote:
> Does this mean there is no way to access the fields t1, t2, t3?
>
>
>
> cat data
>
> {(1,1,1)}
>
> {(2,2,2)(3,3,3)}
>
> {(4,4,4)(5,5,5)(6,6,6)}
>
> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
>
>
>
>
>
> From: Scott Carey [mailto:scott@richrelevance.com]
> Sent: Saturday, July 31, 2010 9:39 AM
> To: pig-user@hadoop.apache.org; Rodriguez, John
> Subject: Re: dereference bag of tuples of fields
>
>
>
> data.isValid
>
> All bags are bags of tuples.  The tuple is intrinsic and invisible at
> the syntax level - its visible to udfs though.  If you nest one more
> tuple in that nested tuple pig gets confused.    So 'bag.field' is
> actually a double dereference - one for the bag and one for the
> intrinsic tuple.
>
> ----- Reply message -----
> From: "Rodriguez, John" <jr...@verisign.com>
> Date: Fri, Jul 30, 2010 3:11 pm
> Subject: dereference bag of tuples of fields
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>
> I have built a bag tuples where the tuples contain fields.
>
>
>
> I am reading SequenceFiles and have reading MyLoader to do this. I
> created a subset of all the fields, "isValid" to make the example
> simpler.
>
>
>
> I am not sure how to apply a dereference operator to this?
>
>
>
> A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
> MyLoader() AS (data: bag{t: tuple(isValid:int)});
>
> DESCRIBE A;
>
> A: {data: {t: (isValid: int)}}
>
>
>
> So all the ways that I have tried to dereference have syntax errors.
>
>
>
> B = GROUP A BY (data.t);
>
> 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (data.t.isValid);
>
> 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (t.isValid);
>
> 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
> (isValid: int)}}
>
>
>
> What is the proper way to do this?
>
>
>
> John Rodriguez
>
>
>
>