You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Walker, Cameron" <Ca...@aig.com> on 2014/08/29 17:58:05 UTC

How JOIN works ?

I am stumped by what, clearly, is my misunderstanding of how PIG does a JOIN.

Data in 'data' file:

A1 B1
A1 B1
A2 B2
A2 B2

My little program:

A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray);

B1 = foreach A generate var1,var2;
B2 = foreach A generate var1,var2;

C = join B1 by (var1,var2), B2 by (var1,var2);

dump C;

What I expect to get:

(A1,B1,A1,B1)
(A2,B2,A2,B2)

What I actually get:

(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)


I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases.



Cameron Walker
Americas IT | AIG Property Casualty
Senior DBA, Enterprise Application Technology Services
200 South College
13th Floor
Charlotte, NC 28202
(Off): 704-338-7423
(Cell): 980-201-0496
mailto:cameron.walker1@aig.com | http://www.aig.com


Re: How JOIN works ?

Posted by Olga Natkovich <on...@yahoo.com.INVALID>.
Join is a cross product in Pig and in SQL meaning that each matching pair would be retained. Since you have 2 rows on each side with the same key, you get 4 rows in the output. If you only want 1, you need to run distinct on both relationships first.


On Friday, August 29, 2014 8:59 AM, "Walker, Cameron" <Ca...@aig.com> wrote:
 


I am stumped by what, clearly, is my misunderstanding of how PIG does a JOIN.

Data in 'data' file:

A1 B1
A1 B1
A2 B2
A2 B2

My little program:

A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray);

B1 = foreach A generate var1,var2;
B2 = foreach A generate var1,var2;

C = join B1 by (var1,var2), B2 by (var1,var2);

dump C;

What I expect to get:

(A1,B1,A1,B1)
(A2,B2,A2,B2)

What I actually get:

(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)


I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases.



Cameron Walker
Americas IT | AIG Property Casualty
Senior DBA, Enterprise Application Technology Services
200 South College
13th Floor
Charlotte, NC 28202
(Off): 704-338-7423
(Cell): 980-201-0496
mailto:cameron.walker1@aig.com | http://www.aig.com