You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Dan DeCapria, CivicScience" <da...@civicscience.com> on 2013/04/19 20:24:27 UTC

LEFT OUTER JOIN?

Consider two aliases (T) and (U), loaded from data with schema defined
below. I was considering a left outer join to 'merge' the two records,
overriding those in U with the join fields in T, but the result isn't what
I desired.  Possibly a filter then union is required? I'm just not sure;
looking for a quick way to get to this result (R):

R: ?[JOIN T BY (a,b,c,d) LEFT OUTER, U BY (a,b,c,d)]: schema : a,b,c,d,x,y
a1 b1 c1 d1 *g1 h1*
a2 b2 c2 d2 e2 f2
a3 b3 c3 d3 *g3 h3*
a4 b4 c4 d4 e4 f4
a5 b5 c5 d5 e5 f5
a6 b6 c6 d6 g6 h6

T : schema : a,b,c,d,x,y
a1 b1 c1 d1 g1 h1
a3 b3 c3 d3 g3 h3
a6 b6 c6 d6 g6 h6

U: schema : a,b,c,d,x,y
a1 b1 c1 d1 e1 f1
a2 b2 c2 d2 e2 f2
a3 b3 c3 d3 e3 f3
a4 b4 c4 d4 e4 f4
a5 b5 c5 d5 e5 f5

Many thanks,  -Dan

Re: LEFT OUTER JOIN?

Posted by "Dan DeCapria, CivicScience" <da...@civicscience.com>.
Have a solution which I personally don't like, but it seems to work for now:

T = LOAD 'T.dat' AS (a:chararray, b:chararray, c:chararray, d:chararray,
x:chararray, y:chararray);
U = LOAD 'U.dat' AS (a:chararray, b:chararray, c:chararray, d:chararray,
x:chararray, y:chararray);
A = JOIN T BY (a,b,c,d) FULL OUTER, U BY (a,b,c,d);
B = FOREACH A GENERATE
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::a : U::a) AS a:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::b : U::b) AS b:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::c : U::c) AS c:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::d : U::d) AS d:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::x : U::x) AS x:chararray,
(T::a IS NOT NULL AND T::b IS NOT NULL AND T::c IS NOT NULL AND T::d IS NOT
NULL ? T::y : U::y) AS y:chararray;
DUMP B;
----
(a1,b1,c1,d1,g1,h1)
(a2,b2,c2,d2,e2,f2)
(a3,b3,c3,d3,g3,h3)
(a4,b4,c4,d4,e4,f4)
(a5,b5,c5,d5,e5,f5)
(a6,b6,c6,d6,g6,h6)

Bumping for an optimization here.

Thanks,  -Dan



On Fri, Apr 19, 2013 at 2:24 PM, Dan DeCapria, CivicScience <
dan.decapria@civicscience.com> wrote:

> Consider two aliases (T) and (U), loaded from data with schema defined
> below. I was considering a left outer join to 'merge' the two records,
> overriding those in U with the join fields in T, but the result isn't what
> I desired.  Possibly a filter then union is required? I'm just not sure;
> looking for a quick way to get to this result (R):
>
> R: ?[JOIN T BY (a,b,c,d) LEFT OUTER, U BY (a,b,c,d)]: schema : a,b,c,d,x,y
> a1 b1 c1 d1 *g1 h1*
> a2 b2 c2 d2 e2 f2
> a3 b3 c3 d3 *g3 h3*
> a4 b4 c4 d4 e4 f4
> a5 b5 c5 d5 e5 f5
> a6 b6 c6 d6 g6 h6
>
> T : schema : a,b,c,d,x,y
> a1 b1 c1 d1 g1 h1
> a3 b3 c3 d3 g3 h3
> a6 b6 c6 d6 g6 h6
>
> U: schema : a,b,c,d,x,y
> a1 b1 c1 d1 e1 f1
> a2 b2 c2 d2 e2 f2
> a3 b3 c3 d3 e3 f3
> a4 b4 c4 d4 e4 f4
> a5 b5 c5 d5 e5 f5
>
> Many thanks,  -Dan
>