You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Weil <ke...@gmail.com> on 2008/10/20 07:36:21 UTC

issue loading bags in the types branch

Hi,

I'm trying to analyze a dataset that looks like (string, number, bag {
string, number }).   (in the pig-types branch.)

In my load function, what should the AS clause for my bag look like?  I'm
doing

... AS (site: chararray, count: int, itemCounts: bag { itemCountsTuple:
tuple (type: chararray, typeCount: int) } )

This parses, and seems to work for some things, but I think there's an issue
down the line with naming the bag's inner tuple.  I'd rather NOT name the
inner tuple, but saying bag { type: chararray, typeCount: int } doesn't
parse, and neither does bag:{tuple(type: chararray, typeCount:int)}, which
is what the suggested syntax is on the "TrunkToTypesChanges" wiki
page<http://wiki.apache.org/pig/TrunkToTypesChanges>.


My problem comes when I try to flatten this tuple.  If I load the data into
'a' and do

b = FOREACH a GENERATE site, count, FLATTEN(itemCounts)

and then dump b, the data looks like a flat list of four elements as it
should.  However, my schema appears to be messed up.  The schema is

b: {site: chararray,count: integer,itemCounts::itemCountsTuple: (type:
chararray, typeCount: int)}

That is, the itemCounts::itemCountsTuple variable still appears to have a
tuple structure!  Once again, this is NOT borne out when I dump the data --
the data itself is flat.  However, I have to refer to the variable as
itemCounts::itemCountsTuple.type in order for any statement to parse, and if
I ever do a FILTER b BY itemCounts::itemCountsTuple.type EQ 'blah' I get an
exception stemming from Pig's attempt to cast a String to a Tuple in
POProject.java (result.res = (Tuple)ret; on line 277 of POProject.java in my
checkout).  I think these are related to the strange post-flatten schema,
because FILTER works on other cases.

This is blocking us entirely for now, and it hopefully is just user error on
my part.  Thanks in advance for any help you can offer!

Kevin

Re: issue loading bags in the types branch

Posted by Kevin Weil <ke...@gmail.com>.
Santhosh,

Great, that makes sense.  I wanted to be sure that what I was experiencing
was indeed a bug before I filed a spurious JIRA ticket.  We are working
around this for now, but are definitely looking forward to seeing the issue
fixed.

Keep up the good work all -- the types branch will be a fantastic release.

Kevin

On Tue, Oct 21, 2008 at 11:44 AM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> Kevin,
>
> By definition, bags are containers of tuples. As a result, the parser
> does not allow you to declare a bag without specify the tuple inside the
> bag. We need a JIRA to fix the issue regarding naming the tuple inside
> the bag.
>
> Currently, the Pig front-end is not consistent in the way schemas for
> bags is handled. When columns are flattened, the expected behaviour is
> to remove one level of indirection if it's a tuple and two levels of
> indirection if it's a bag, i.e., access the elements of the tuple and
> access the elements of the tuple inside the bag respectively. As a
> result instead of seeing the contents of the tuple inside the bag (i.e.,
> type and typeCount) you are seeing the tuple when you flatten the bag.
>
> There is a JIRA to track this issue:
> https://issues.apache.org/jira/browse/PIG-449 This bug has to be
> resolved in order to unblock you.
>
> Santhosh
>
> -----Original Message-----
> From: Kevin Weil [mailto:kevinweil@gmail.com]
> Sent: Sunday, October 19, 2008 10:36 PM
> To: pig-user@incubator.apache.org
> Subject: issue loading bags in the types branch
>
> Hi,
>
> I'm trying to analyze a dataset that looks like (string, number, bag {
> string, number }).   (in the pig-types branch.)
>
> In my load function, what should the AS clause for my bag look like?
> I'm
> doing
>
> ... AS (site: chararray, count: int, itemCounts: bag { itemCountsTuple:
> tuple (type: chararray, typeCount: int) } )
>
> This parses, and seems to work for some things, but I think there's an
> issue
> down the line with naming the bag's inner tuple.  I'd rather NOT name
> the
> inner tuple, but saying bag { type: chararray, typeCount: int } doesn't
> parse, and neither does bag:{tuple(type: chararray, typeCount:int)},
> which
> is what the suggested syntax is on the "TrunkToTypesChanges" wiki
> page<http://wiki.apache.org/pig/TrunkToTypesChanges>.
>
>
> My problem comes when I try to flatten this tuple.  If I load the data
> into
> 'a' and do
>
> b = FOREACH a GENERATE site, count, FLATTEN(itemCounts)
>
> and then dump b, the data looks like a flat list of four elements as it
> should.  However, my schema appears to be messed up.  The schema is
>
> b: {site: chararray,count: integer,itemCounts::itemCountsTuple: (type:
> chararray, typeCount: int)}
>
> That is, the itemCounts::itemCountsTuple variable still appears to have
> a
> tuple structure!  Once again, this is NOT borne out when I dump the data
> --
> the data itself is flat.  However, I have to refer to the variable as
> itemCounts::itemCountsTuple.type in order for any statement to parse,
> and if
> I ever do a FILTER b BY itemCounts::itemCountsTuple.type EQ 'blah' I get
> an
> exception stemming from Pig's attempt to cast a String to a Tuple in
> POProject.java (result.res = (Tuple)ret; on line 277 of POProject.java
> in my
> checkout).  I think these are related to the strange post-flatten
> schema,
> because FILTER works on other cases.
>
> This is blocking us entirely for now, and it hopefully is just user
> error on
> my part.  Thanks in advance for any help you can offer!
>
> Kevin
>

RE: issue loading bags in the types branch

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
Kevin,

By definition, bags are containers of tuples. As a result, the parser
does not allow you to declare a bag without specify the tuple inside the
bag. We need a JIRA to fix the issue regarding naming the tuple inside
the bag.

Currently, the Pig front-end is not consistent in the way schemas for
bags is handled. When columns are flattened, the expected behaviour is
to remove one level of indirection if it's a tuple and two levels of
indirection if it's a bag, i.e., access the elements of the tuple and
access the elements of the tuple inside the bag respectively. As a
result instead of seeing the contents of the tuple inside the bag (i.e.,
type and typeCount) you are seeing the tuple when you flatten the bag.

There is a JIRA to track this issue:
https://issues.apache.org/jira/browse/PIG-449 This bug has to be
resolved in order to unblock you.

Santhosh 

-----Original Message-----
From: Kevin Weil [mailto:kevinweil@gmail.com] 
Sent: Sunday, October 19, 2008 10:36 PM
To: pig-user@incubator.apache.org
Subject: issue loading bags in the types branch

Hi,

I'm trying to analyze a dataset that looks like (string, number, bag {
string, number }).   (in the pig-types branch.)

In my load function, what should the AS clause for my bag look like?
I'm
doing

... AS (site: chararray, count: int, itemCounts: bag { itemCountsTuple:
tuple (type: chararray, typeCount: int) } )

This parses, and seems to work for some things, but I think there's an
issue
down the line with naming the bag's inner tuple.  I'd rather NOT name
the
inner tuple, but saying bag { type: chararray, typeCount: int } doesn't
parse, and neither does bag:{tuple(type: chararray, typeCount:int)},
which
is what the suggested syntax is on the "TrunkToTypesChanges" wiki
page<http://wiki.apache.org/pig/TrunkToTypesChanges>.


My problem comes when I try to flatten this tuple.  If I load the data
into
'a' and do

b = FOREACH a GENERATE site, count, FLATTEN(itemCounts)

and then dump b, the data looks like a flat list of four elements as it
should.  However, my schema appears to be messed up.  The schema is

b: {site: chararray,count: integer,itemCounts::itemCountsTuple: (type:
chararray, typeCount: int)}

That is, the itemCounts::itemCountsTuple variable still appears to have
a
tuple structure!  Once again, this is NOT borne out when I dump the data
--
the data itself is flat.  However, I have to refer to the variable as
itemCounts::itemCountsTuple.type in order for any statement to parse,
and if
I ever do a FILTER b BY itemCounts::itemCountsTuple.type EQ 'blah' I get
an
exception stemming from Pig's attempt to cast a String to a Tuple in
POProject.java (result.res = (Tuple)ret; on line 277 of POProject.java
in my
checkout).  I think these are related to the strange post-flatten
schema,
because FILTER works on other cases.

This is blocking us entirely for now, and it hopefully is just user
error on
my part.  Thanks in advance for any help you can offer!

Kevin