You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by James Kebinger <jk...@gmail.com> on 2011/02/17 00:57:03 UTC

Problems with union, projection producing unexpected results

Hello all, I've been scratching my head over a problem with a pig script I'm
having, and hoping another set of eyeballs will help. I'm using pig 0.8, in
local mode

Here's my simplified use case:

I have a log file with events on pages, and the id of the event can be a
users login or a users numeric id:

2010-05-14,james
2010-05-15,123
2010-05-15,23
2010-05-15,456
2010-05-15,notjames

So i want to join a set of users on either the login or user id.

Here's my users:

123,james,11
234,notjames,11
456,someoneelse,11


So I thought I would be clever and load the user list, union it with itself
to generate a relation where each user is represented twice, once by login,
once by id:

logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
actual_user_id;
user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
actual_user_id;
user_id_or_login_lookup = UNION logins, user_ids;



user_id_or_login_lookup: {matching_id: chararray,actual_user_id: chararray}
(123,123)
(234,234)
(456,456)
(james,123)
(notjames,234)
(someoneelse,456)

Then join on that, by the first column, and project that away, leaving just
the event info and the numeric id.

views_with_id = JOIN profile_views by viewed_user_id,
user_id_or_login_lookup by matching_id;

That is not working however. My joined relation looks like this (which is
what I expect)

views_with_id: {profile_views::date:
chararray,profile_views::viewed_user_id:
chararray,user_id_or_login_lookup::matching_id:
chararray,user_id_or_login_lookup::actual_user_id: chararray}

(2010-05-15,123,123,123)
(2010-05-15,456,456,456)
(2010-05-14,james,james,123)
(2010-05-15,notjames,notjames,234)

But when I project as follows: views_with_id_projected = FOREACH
views_with_id GENERATE date, viewed_user_id,
user_id_or_login_lookup::actual_user_id;

The result is not what I expect

(2010-05-15,123,123)
(2010-05-15,456,456)
(2010-05-14,james,james)
(2010-05-15,notjames,notjames)

To be clear, I expect

(2010-05-15,123,123)
(2010-05-15,456,456)
(2010-05-14,james,123)
(2010-05-15,notjames,456)


Can anyone give me a push in the right direction?

Thanks, James

Here's my full pig script:

users = LOAD 'patients-test.txt' USING PigStorage(',') AS
(user_id:chararray, login:chararray, disease_id: chararray);
profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') AS(date:
chararray, viewed_user_id:chararray);

dump users;
dump profile_views;

-- build a relation so that users are present to join by login or user_id
logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
actual_user_id;
user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
actual_user_id;
user_id_or_login_lookup = UNION logins, user_ids;

dump user_id_or_login_lookup;
describe user_id_or_login_lookup;

views_with_id = JOIN profile_views by viewed_user_id,
user_id_or_login_lookup by matching_id;

describe views_with_id;
--STORE views_with_id into 'ep-views.txt';

dump views_with_id;

views_with_id_projected = FOREACH views_with_id GENERATE date,
viewed_user_id, user_id_or_login_lookup::actual_user_id;


dump views_with_id_projected;

Re: Problems with union, projection producing unexpected results

Posted by James Kebinger <jk...@gmail.com>.
https://issues.apache.org/jira/browse/PIG-1859

On Thu, Feb 17, 2011 at 12:32 PM, James Kebinger <jk...@gmail.com>wrote:

> Interesting, maybe I should file a bug report then?
>
>
> On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <jc...@gmail.com>wrote:
>
>> I am glad that you got this in a replicatable form! I have seen this error
>> as well (where the output is just the last value repeated instead of the
>> multiples that you want), but wasn't able to give a concrete example.
>>
>> 2011/2/16 James Kebinger <jk...@gmail.com>
>>
>> > Hello all, I've been scratching my head over a problem with a pig script
>> > I'm
>> > having, and hoping another set of eyeballs will help. I'm using pig 0.8,
>> in
>> > local mode
>> >
>> > Here's my simplified use case:
>> >
>> > I have a log file with events on pages, and the id of the event can be a
>> > users login or a users numeric id:
>> >
>> > 2010-05-14,james
>> > 2010-05-15,123
>> > 2010-05-15,23
>> > 2010-05-15,456
>> > 2010-05-15,notjames
>> >
>> > So i want to join a set of users on either the login or user id.
>> >
>> > Here's my users:
>> >
>> > 123,james,11
>> > 234,notjames,11
>> > 456,someoneelse,11
>> >
>> >
>> > So I thought I would be clever and load the user list, union it with
>> itself
>> > to generate a relation where each user is represented twice, once by
>> login,
>> > once by id:
>> >
>> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
>> > actual_user_id;
>> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
>> > actual_user_id;
>> > user_id_or_login_lookup = UNION logins, user_ids;
>> >
>> >
>> >
>> > user_id_or_login_lookup: {matching_id: chararray,actual_user_id:
>> chararray}
>> > (123,123)
>> > (234,234)
>> > (456,456)
>> > (james,123)
>> > (notjames,234)
>> > (someoneelse,456)
>> >
>> > Then join on that, by the first column, and project that away, leaving
>> just
>> > the event info and the numeric id.
>> >
>> > views_with_id = JOIN profile_views by viewed_user_id,
>> > user_id_or_login_lookup by matching_id;
>> >
>> > That is not working however. My joined relation looks like this (which
>> is
>> > what I expect)
>> >
>> > views_with_id: {profile_views::date:
>> > chararray,profile_views::viewed_user_id:
>> > chararray,user_id_or_login_lookup::matching_id:
>> > chararray,user_id_or_login_lookup::actual_user_id: chararray}
>> >
>> > (2010-05-15,123,123,123)
>> > (2010-05-15,456,456,456)
>> > (2010-05-14,james,james,123)
>> > (2010-05-15,notjames,notjames,234)
>> >
>> > But when I project as follows: views_with_id_projected = FOREACH
>> > views_with_id GENERATE date, viewed_user_id,
>> > user_id_or_login_lookup::actual_user_id;
>> >
>> > The result is not what I expect
>> >
>> > (2010-05-15,123,123)
>> > (2010-05-15,456,456)
>> > (2010-05-14,james,james)
>> > (2010-05-15,notjames,notjames)
>> >
>> > To be clear, I expect
>> >
>> > (2010-05-15,123,123)
>> > (2010-05-15,456,456)
>> > (2010-05-14,james,123)
>> > (2010-05-15,notjames,456)
>> >
>> >
>> > Can anyone give me a push in the right direction?
>> >
>> > Thanks, James
>> >
>> > Here's my full pig script:
>> >
>> > users = LOAD 'patients-test.txt' USING PigStorage(',') AS
>> > (user_id:chararray, login:chararray, disease_id: chararray);
>> > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',')
>> > AS(date:
>> > chararray, viewed_user_id:chararray);
>> >
>> > dump users;
>> > dump profile_views;
>> >
>> > -- build a relation so that users are present to join by login or
>> user_id
>> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
>> > actual_user_id;
>> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
>> > actual_user_id;
>> > user_id_or_login_lookup = UNION logins, user_ids;
>> >
>> > dump user_id_or_login_lookup;
>> > describe user_id_or_login_lookup;
>> >
>> > views_with_id = JOIN profile_views by viewed_user_id,
>> > user_id_or_login_lookup by matching_id;
>> >
>> > describe views_with_id;
>> > --STORE views_with_id into 'ep-views.txt';
>> >
>> > dump views_with_id;
>> >
>> > views_with_id_projected = FOREACH views_with_id GENERATE date,
>> > viewed_user_id, user_id_or_login_lookup::actual_user_id;
>> >
>> >
>> > dump views_with_id_projected;
>> >
>>
>
>

Re: Problems with union, projection producing unexpected results

Posted by James Kebinger <jk...@gmail.com>.
Interesting, maybe I should file a bug report then?

On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I am glad that you got this in a replicatable form! I have seen this error
> as well (where the output is just the last value repeated instead of the
> multiples that you want), but wasn't able to give a concrete example.
>
> 2011/2/16 James Kebinger <jk...@gmail.com>
>
> > Hello all, I've been scratching my head over a problem with a pig script
> > I'm
> > having, and hoping another set of eyeballs will help. I'm using pig 0.8,
> in
> > local mode
> >
> > Here's my simplified use case:
> >
> > I have a log file with events on pages, and the id of the event can be a
> > users login or a users numeric id:
> >
> > 2010-05-14,james
> > 2010-05-15,123
> > 2010-05-15,23
> > 2010-05-15,456
> > 2010-05-15,notjames
> >
> > So i want to join a set of users on either the login or user id.
> >
> > Here's my users:
> >
> > 123,james,11
> > 234,notjames,11
> > 456,someoneelse,11
> >
> >
> > So I thought I would be clever and load the user list, union it with
> itself
> > to generate a relation where each user is represented twice, once by
> login,
> > once by id:
> >
> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> > actual_user_id;
> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> > actual_user_id;
> > user_id_or_login_lookup = UNION logins, user_ids;
> >
> >
> >
> > user_id_or_login_lookup: {matching_id: chararray,actual_user_id:
> chararray}
> > (123,123)
> > (234,234)
> > (456,456)
> > (james,123)
> > (notjames,234)
> > (someoneelse,456)
> >
> > Then join on that, by the first column, and project that away, leaving
> just
> > the event info and the numeric id.
> >
> > views_with_id = JOIN profile_views by viewed_user_id,
> > user_id_or_login_lookup by matching_id;
> >
> > That is not working however. My joined relation looks like this (which is
> > what I expect)
> >
> > views_with_id: {profile_views::date:
> > chararray,profile_views::viewed_user_id:
> > chararray,user_id_or_login_lookup::matching_id:
> > chararray,user_id_or_login_lookup::actual_user_id: chararray}
> >
> > (2010-05-15,123,123,123)
> > (2010-05-15,456,456,456)
> > (2010-05-14,james,james,123)
> > (2010-05-15,notjames,notjames,234)
> >
> > But when I project as follows: views_with_id_projected = FOREACH
> > views_with_id GENERATE date, viewed_user_id,
> > user_id_or_login_lookup::actual_user_id;
> >
> > The result is not what I expect
> >
> > (2010-05-15,123,123)
> > (2010-05-15,456,456)
> > (2010-05-14,james,james)
> > (2010-05-15,notjames,notjames)
> >
> > To be clear, I expect
> >
> > (2010-05-15,123,123)
> > (2010-05-15,456,456)
> > (2010-05-14,james,123)
> > (2010-05-15,notjames,456)
> >
> >
> > Can anyone give me a push in the right direction?
> >
> > Thanks, James
> >
> > Here's my full pig script:
> >
> > users = LOAD 'patients-test.txt' USING PigStorage(',') AS
> > (user_id:chararray, login:chararray, disease_id: chararray);
> > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',')
> > AS(date:
> > chararray, viewed_user_id:chararray);
> >
> > dump users;
> > dump profile_views;
> >
> > -- build a relation so that users are present to join by login or user_id
> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> > actual_user_id;
> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> > actual_user_id;
> > user_id_or_login_lookup = UNION logins, user_ids;
> >
> > dump user_id_or_login_lookup;
> > describe user_id_or_login_lookup;
> >
> > views_with_id = JOIN profile_views by viewed_user_id,
> > user_id_or_login_lookup by matching_id;
> >
> > describe views_with_id;
> > --STORE views_with_id into 'ep-views.txt';
> >
> > dump views_with_id;
> >
> > views_with_id_projected = FOREACH views_with_id GENERATE date,
> > viewed_user_id, user_id_or_login_lookup::actual_user_id;
> >
> >
> > dump views_with_id_projected;
> >
>

Re: Problems with union, projection producing unexpected results

Posted by Jonathan Coveney <jc...@gmail.com>.
I am glad that you got this in a replicatable form! I have seen this error
as well (where the output is just the last value repeated instead of the
multiples that you want), but wasn't able to give a concrete example.

2011/2/16 James Kebinger <jk...@gmail.com>

> Hello all, I've been scratching my head over a problem with a pig script
> I'm
> having, and hoping another set of eyeballs will help. I'm using pig 0.8, in
> local mode
>
> Here's my simplified use case:
>
> I have a log file with events on pages, and the id of the event can be a
> users login or a users numeric id:
>
> 2010-05-14,james
> 2010-05-15,123
> 2010-05-15,23
> 2010-05-15,456
> 2010-05-15,notjames
>
> So i want to join a set of users on either the login or user id.
>
> Here's my users:
>
> 123,james,11
> 234,notjames,11
> 456,someoneelse,11
>
>
> So I thought I would be clever and load the user list, union it with itself
> to generate a relation where each user is represented twice, once by login,
> once by id:
>
> logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> actual_user_id;
> user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> actual_user_id;
> user_id_or_login_lookup = UNION logins, user_ids;
>
>
>
> user_id_or_login_lookup: {matching_id: chararray,actual_user_id: chararray}
> (123,123)
> (234,234)
> (456,456)
> (james,123)
> (notjames,234)
> (someoneelse,456)
>
> Then join on that, by the first column, and project that away, leaving just
> the event info and the numeric id.
>
> views_with_id = JOIN profile_views by viewed_user_id,
> user_id_or_login_lookup by matching_id;
>
> That is not working however. My joined relation looks like this (which is
> what I expect)
>
> views_with_id: {profile_views::date:
> chararray,profile_views::viewed_user_id:
> chararray,user_id_or_login_lookup::matching_id:
> chararray,user_id_or_login_lookup::actual_user_id: chararray}
>
> (2010-05-15,123,123,123)
> (2010-05-15,456,456,456)
> (2010-05-14,james,james,123)
> (2010-05-15,notjames,notjames,234)
>
> But when I project as follows: views_with_id_projected = FOREACH
> views_with_id GENERATE date, viewed_user_id,
> user_id_or_login_lookup::actual_user_id;
>
> The result is not what I expect
>
> (2010-05-15,123,123)
> (2010-05-15,456,456)
> (2010-05-14,james,james)
> (2010-05-15,notjames,notjames)
>
> To be clear, I expect
>
> (2010-05-15,123,123)
> (2010-05-15,456,456)
> (2010-05-14,james,123)
> (2010-05-15,notjames,456)
>
>
> Can anyone give me a push in the right direction?
>
> Thanks, James
>
> Here's my full pig script:
>
> users = LOAD 'patients-test.txt' USING PigStorage(',') AS
> (user_id:chararray, login:chararray, disease_id: chararray);
> profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',')
> AS(date:
> chararray, viewed_user_id:chararray);
>
> dump users;
> dump profile_views;
>
> -- build a relation so that users are present to join by login or user_id
> logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> actual_user_id;
> user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> actual_user_id;
> user_id_or_login_lookup = UNION logins, user_ids;
>
> dump user_id_or_login_lookup;
> describe user_id_or_login_lookup;
>
> views_with_id = JOIN profile_views by viewed_user_id,
> user_id_or_login_lookup by matching_id;
>
> describe views_with_id;
> --STORE views_with_id into 'ep-views.txt';
>
> dump views_with_id;
>
> views_with_id_projected = FOREACH views_with_id GENERATE date,
> viewed_user_id, user_id_or_login_lookup::actual_user_id;
>
>
> dump views_with_id_projected;
>