You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2010/12/31 01:35:39 UTC

FLATTEN eats null rows?

Seems after FLATTEN, the rows with null values get dropped.

I have two test files:

% cat test1.txt
1 a b
2 c d
3 e f

% cat test2.txt
1 x
2 y
6 z
8 w

I'm trying to cogroup the two on the first column:

A = LOAD 'test1.txt' AS (id, f1, f2);
B = LOAD 'test2.txt' AS (id, f3);
C = COGROUP A BY id, B BY id;
DUMP C;

(1,{(1,a,b)},{(1,x)})
(2,{(2,c,d)},{(2,y)})
(3,{(3,e,f)},{})
(6,{},{(6,z)})
(8,{},{(8,w)})

D = FOREACH C GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.f3);
DUMP D;

(1,a,b,x)
(2,c,d,y)

E = FOREACH C GENERATE group, A.(f1, f2), B.f3;
DUMP E

(1,{(a,b)},{(x)})
(2,{(c,d)},{(y)})
(3,{(e,f)},{})
(6,{},{(z)})
(8,{},{(w)})

You see if I do FLATTEN, all the rows with null values are all missing (in
D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
obviously. What I want as the end result is:

(1,a,b,x)
(2,c,d,y)
(3,e,f,{})
(6,,,z)
(8,,,w)

How can I get that? Thanks.

Dexin

P.S.

I realize I could do FULL JOIN, but the problem is that after join, I
wouldn't know which id is null, I would have to do many if then in the
following generate command and I hope I can avoid that. E.g.,

C = JOIN A BY id FULL, B BY id;
DUMP C
(1,a,b,1,x)
(2,c,d,2,y)
(3,e,f,,)
(,,,6,z)
(,,,8,w)

DESCRIBE C;
C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
bytearray,B::f3: bytearray}

Sometimes A::id is null, sometimes B::id null, I always only want the
non-null id in my output.

Re: FLATTEN eats null rows?

Posted by Dexin Wang <wa...@gmail.com>.
Thanks. Both worked fine.

I think I'll make a MyFlatten that doesn't drop the empty bag. Say you want
to COGROUP 3 or more bags, you would have to do a many COGROUP or JOIN, then
do IsEmpty or bincond every time. Istead, with MyFlatten, I would do:

X = COGROUP A BY id, B BY id, C BY id, D BY id;
Y = FOREACH X GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.(f3,f4,f5)),
FLATTEN(C.f6), FLATTEN(D.f7);

code will be a lot conciser and cleaner.

On Thu, Dec 30, 2010 at 6:46 PM, Thejas M Nair <te...@yahoo-inc.com> wrote:

>
>
>
> On 12/30/10 4:35 PM, "Dexin Wang" <wa...@gmail.com> wrote:
>
> > Seems after FLATTEN, the rows with null values get dropped.
> >
> What you are seeing is the expected/documented behavior of flatten -
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
> "Note that the flatten of empty bag will result in that row being
> discarded"
> (Note that its 'empty bag' not 'null').
>
>
> >
> > You see if I do FLATTEN, all the rows with null values are all missing
> (in
> > D). If I don't do FLATTEN, as in E, I have all the rows but not
> flattened,
> > obviously. What I want as the end result is:
> >
> > (1,a,b,x)
> > (2,c,d,y)
> > (3,e,f,{})
> > (6,,,z)
> > (8,,,w)
> >
> > How can I get that? Thanks.
> >
>
>  D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
> FLATTEN((IsEmpty(B) ? null : B.f3));
>
>
> > I realize I could do FULL JOIN, but the problem is that after join, I
> > wouldn't know which id is null, I would have to do many if then in the
> > following generate command and I hope I can avoid that. E.g.,
> >
> > C = JOIN A BY id FULL, B BY id;
> > DUMP C
> > (1,a,b,1,x)
> > (2,c,d,2,y)
> > (3,e,f,,)
> > (,,,6,z)
> > (,,,8,w)
> >
>
>
>
> > DESCRIBE C;
> > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> > bytearray,B::f3: bytearray}
> >
> > Sometimes A::id is null, sometimes B::id null, I always only want the
> > non-null id in my output.
> >
>
> You can get this by using the conditional expression (called bincond in pig
> documents) (? : ).
>
> E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2,
> B::F3;
>
> -Thejas
>
>
>
>

Re: FLATTEN eats null rows?

Posted by Thejas M Nair <te...@yahoo-inc.com>.


On 12/30/10 4:35 PM, "Dexin Wang" <wa...@gmail.com> wrote:

> Seems after FLATTEN, the rows with null values get dropped.
> 
What you are seeing is the expected/documented behavior of flatten -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
"Note that the flatten of empty bag will result in that row being discarded"
(Note that its 'empty bag' not 'null').


> 
> You see if I do FLATTEN, all the rows with null values are all missing (in
> D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
> obviously. What I want as the end result is:
> 
> (1,a,b,x)
> (2,c,d,y)
> (3,e,f,{})
> (6,,,z)
> (8,,,w)
> 
> How can I get that? Thanks.
> 

 D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
FLATTEN((IsEmpty(B) ? null : B.f3));
 

> I realize I could do FULL JOIN, but the problem is that after join, I
> wouldn't know which id is null, I would have to do many if then in the
> following generate command and I hope I can avoid that. E.g.,
> 
> C = JOIN A BY id FULL, B BY id;
> DUMP C
> (1,a,b,1,x)
> (2,c,d,2,y)
> (3,e,f,,)
> (,,,6,z)
> (,,,8,w)
> 



> DESCRIBE C;
> C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> bytearray,B::f3: bytearray}
> 
> Sometimes A::id is null, sometimes B::id null, I always only want the
> non-null id in my output.
> 

You can get this by using the conditional expression (called bincond in pig
documents) (? : ).

E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3;

-Thejas