You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2012/06/26 15:35:52 UTC

join result dataset bigger than before

Hi there,

I'm doing a join like this:

A = LOAD '/data/sessions' USING PigStorage(',') AS
(userid:chararray, client_type:chararray, flag:long);

A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(543872)

B = LOAD '/data/userdb'  USING PigStorage(',') AS (uid:chararray,
birth_year:int);
A = JOIN A by userid, B by uid;
A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(1079122)

Now the dataset has more rows than before the join which is basically the
opposite of what I'm expecting as not all userids on A do have a uid on the
B dataset.

Does anyone of you do have a hint what the problem here is?

Thanks,
-Marco

Re: join result dataset bigger than before

Posted by Marco Cadetg <ma...@zattoo.com>.
hrm this is obviously my bad. The right dataset was just having multiple
keys... Sorry if someone has taken the time to read the garbage.

Cheers,
-Marco

On Tue, Jun 26, 2012 at 3:35 PM, Marco Cadetg <ma...@zattoo.com> wrote:

> Hi there,
>
> I'm doing a join like this:
>
> A = LOAD '/data/sessions' USING PigStorage(',') AS
> (userid:chararray, client_type:chararray, flag:long);
>
> A1 = GROUP bettyy_sessions ALL;
> A1 = FOREACH A1 GENERATE COUNT(A);
> DUMP A1
> (543872)
>
> B = LOAD '/data/userdb'  USING PigStorage(',') AS (uid:chararray,
> birth_year:int);
> A = JOIN A by userid, B by uid;
> A1 = GROUP bettyy_sessions ALL;
> A1 = FOREACH A1 GENERATE COUNT(A);
> DUMP A1
> (1079122)
>
> Now the dataset has more rows than before the join which is basically the
> opposite of what I'm expecting as not all userids on A do have a uid on the
> B dataset.
>
> Does anyone of you do have a hint what the problem here is?
>
> Thanks,
> -Marco
>