You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dave Viner <da...@gmail.com> on 2010/06/01 20:31:17 UTC

cogroup and flattening optionally empty bags

I am having some trouble getting cogroup and flattening to work as I'd like.
 The cogroup statement looks like:

cg = COGROUP A BY aid INNER,  B BY bid;

The cg group has rows in which the information in B may be empty (as
expected).  I'd like to output a series of rows each of which has the same
number of columns.  If the cg group has empty information for B, then it
should output either NULL or an empty string.  But, I can't seem to make it
work.


for_output = FOREACH cg
    GENERATE FLATTEN(A.aid) AS aid,
        FLATTEN(B.optional_b_col);

If the cogroup cg has empty values in the B bag, then there is no
corresponding row in for_output.

How do I get the row to be added to for_output with an empty value for
"optional_b_col"?

I also tried something like:

for_output = FOREACH cg
    GENERATE FLATTEN(A.aid) AS aid,
        (B.optional_b_col IS NOT NULL ? B.optional_b_col : '');

But, this gives an error when trying to dump the results:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1050: Unsupported input type
for BinCond: left hand side: bag; right hand side: chararray


I imagine there must be some way to output empty strings, I just can't seem
to figure it out.

Thanks
Dave Viner

Re: cogroup and flattening optionally empty bags

Posted by hc busy <hc...@gmail.com>.
yeah, something like this should work:

o = foreach cg generate FLATTEN(A.a_column) as a_column,
((IsEmpty(B))?toBag(toTuple(null)):(B.b_column)) as B2:
o2 = foreach o generate a_column, FLATTEN(B2);


Although, I seem to recall a more elegant way of doing this, but it escapes
at the moment ...


How come you didn't try the outer join?

cg = JOIN A by aid RIGHT OUTER, B by bid;


?


On Tue, Jun 1, 2010 at 11:31 AM, Dave Viner <da...@gmail.com> wrote:

> I am having some trouble getting cogroup and flattening to work as I'd
> like.
>  The cogroup statement looks like:
>
> cg = COGROUP A BY aid INNER,  B BY bid;
>
> The cg group has rows in which the information in B may be empty (as
> expected).  I'd like to output a series of rows each of which has the same
> number of columns.  If the cg group has empty information for B, then it
> should output either NULL or an empty string.  But, I can't seem to make it
> work.
>
>
> for_output = FOREACH cg
>    GENERATE FLATTEN(A.aid) AS aid,
>        FLATTEN(B.optional_b_col);
>
> If the cogroup cg has empty values in the B bag, then there is no
> corresponding row in for_output.
>
> How do I get the row to be added to for_output with an empty value for
> "optional_b_col"?
>
> I also tried something like:
>
> for_output = FOREACH cg
>    GENERATE FLATTEN(A.aid) AS aid,
>        (B.optional_b_col IS NOT NULL ? B.optional_b_col : '');
>
> But, this gives an error when trying to dump the results:
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1050: Unsupported input type
> for BinCond: left hand side: bag; right hand side: chararray
>
>
> I imagine there must be some way to output empty strings, I just can't seem
> to figure it out.
>
> Thanks
> Dave Viner
>

Re: cogroup and flattening optionally empty bags

Posted by Scott Carey <sc...@richrelevance.com>.
On Jun 1, 2010, at 11:31 AM, Dave Viner wrote:

> I am having some trouble getting cogroup and flattening to work as I'd like.
> The cogroup statement looks like:
> 
> cg = COGROUP A BY aid INNER,  B BY bid;
> 
> The cg group has rows in which the information in B may be empty (as
> expected).  I'd like to output a series of rows each of which has the same
> number of columns.  If the cg group has empty information for B, then it
> should output either NULL or an empty string.  But, I can't seem to make it
> work.
> 
> 
> for_output = FOREACH cg
>    GENERATE FLATTEN(A.aid) AS aid,
>        FLATTEN(B.optional_b_col);
> 
> If the cogroup cg has empty values in the B bag, then there is no
> corresponding row in for_output.
> 
> How do I get the row to be added to for_output with an empty value for
> "optional_b_col"?
> 
> I also tried something like:
> 
> for_output = FOREACH cg
>    GENERATE FLATTEN(A.aid) AS aid,
>        (B.optional_b_col IS NOT NULL ? B.optional_b_col : '');
> 

Given that something similar is in the documentation, you would expect it to work.  But it doesn't.

See http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#Nulls the second example after "Nulls and Constants" says you can do :
------
"In this example of an outer join, if the join key is missing from a table it is replaced by null."
A = LOAD 'student' AS (name: chararray, age: int, gpa: float);
B = LOAD 'votertab10k' AS (name: chararray, age: int, registration: chararray, donation: float);
C = COGROUP A BY name, B BY name;
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B));
-------

But I have had trouble with this working as described as an 'outer join' using cogroup. 
The techniques that hc.busy mentions work -- but are clunky and there aren't good alternatives that I know of at the moment.  I'd love to hear what the "official" way to do an outer join using COGROUP is.  
FLATTEN hates being one of the sides of a conditional, so you can't do the intuitive:
  (isEmpty(B.optional_b_col) ? null : FLATTEN(B.optional_b_col)
Instead you have to put a conditional inside FLATTEN, and replace the null with a 'bag of one tuple with one null field' so that it doesn't collapse the row and returns a null instead.  With no built in ways to produce the 'bag of one tuple with one null field' (as of 0.7) this means writing your own UDF.

OUTER JOIN is sometimes an option, but it isn't always an option, especially if you don't want to produce the cross product of the bags or need to do a more custom join.



> But, this gives an error when trying to dump the results:
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1050: Unsupported input type
> for BinCond: left hand side: bag; right hand side: chararray
> 
> 
> I imagine there must be some way to output empty strings, I just can't seem
> to figure it out.
> 
> Thanks
> Dave Viner