You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by James Leek <le...@llnl.gov> on 2009/11/25 20:41:54 UTC

Diffing two bags?

Hi, I'm trying to do something with hadoop or pig, that I thought would 
be pretty straightforward, but it turning out to be difficult for me to 
implement.  Of course, I'm very new to this, so I'm probably missing 
something obvious.

What I want to do is a set difference.  I would like to take 2 bags, and 
remove the values they have in common between them.  Let's say I have 
two bags, 'students' and 'employees'.  I want to find which students are 
just students, and which employees are just employees.  So, an example:

Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also 
employees, or: (Dave).

However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)

This should be do-able in a single map-reduce pass, but I found I was 
going to have to write a custom inputter for it so I could remember 
which values were from the students file and which were from the 
employees file.  (At least, I wasn't able to figure that bit out.)  I 
also wasn't sure how to write the output to two separate files.

So I thought pig might have some quick way to do this, but so far I've 
had no luck even expressing set subtraction in pig.  (I could do this 
less efficiently with set subtraction like so: only_employee = employees 
- join(students, employees)  )

Does anyone know what I'm missing?
Thanks,
Jim

Re: Diffing two bags?

Posted by Alan Gates <ga...@yahoo-inc.com>.
Do you want to keep the distinct values separate by input, or mingle  
them?  The following script will keep them separate.
A = load 'students' as (name);
B = load 'employees' as (name);
C = cogroup A by name, B by name;
D = filter C by IsEmpty(A);
E = foreach D generate flatten(B);
store E into 'only_employees';
F = filter C by IsEmpty(B);
G = foreach F flatten(A);
store G into 'only_students';

to mingle them replace the two store calls by:

H = union E, G;
store H into 'only_employees_or_students';

Alan.

On Nov 25, 2009, at 11:41 AM, James Leek wrote:

> Hi, I'm trying to do something with hadoop or pig, that I thought  
> would be pretty straightforward, but it turning out to be difficult  
> for me to implement.  Of course, I'm very new to this, so I'm  
> probably missing something obvious.
>
> What I want to do is a set difference.  I would like to take 2 bags,  
> and remove the values they have in common between them.  Let's say I  
> have two bags, 'students' and 'employees'.  I want to find which  
> students are just students, and which employees are just employees.   
> So, an example:
>
> Students:
> (Jane)
> (John)
> (Dave)
>
> Employees:
> (Dave)
> (Sue)
> (Anne)
>
> If I were to join these, I would get the students who are also  
> employees, or: (Dave).
>
> However, what I want is the distinct values:
>
> Only_Student:
> (Jane)
> (John)
>
> Only_Employee:
> (Sue)
> (Anne)
>
> This should be do-able in a single map-reduce pass, but I found I  
> was going to have to write a custom inputter for it so I could  
> remember which values were from the students file and which were  
> from the employees file.  (At least, I wasn't able to figure that  
> bit out.)  I also wasn't sure how to write the output to two  
> separate files.
>
> So I thought pig might have some quick way to do this, but so far  
> I've had no luck even expressing set subtraction in pig.  (I could  
> do this less efficiently with set subtraction like so: only_employee  
> = employees - join(students, employees)  )
>
> Does anyone know what I'm missing?
> Thanks,
> Jim


Re: Diffing two bags?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Alan's use of cogroup is better and more "piggly".
I'm still mentally sql-bound....

-D

On Wed, Nov 25, 2009 at 2:58 PM, James Leek <le...@llnl.gov> wrote:

> Dmitriy Ryaboy wrote:
>
>> Hi Jim,
>> This sounds like a full outer join, with the nulls on the left meaning an
>> employee is just an employee, and a null on the right meaning a student is
>> just a student.
>>
>>
> Ah, good call.  Thanks.  I see there is a way to do outer join given in the
> pig latin manual.
>
> Thanks,
> Jim
>

Re: Diffing two bags?

Posted by James Leek <le...@llnl.gov>.
Dmitriy Ryaboy wrote:
> Hi Jim,
> This sounds like a full outer join, with the nulls on the left meaning an
> employee is just an employee, and a null on the right meaning a student is
> just a student.
>   
Ah, good call.  Thanks.  I see there is a way to do outer join given in 
the pig latin manual.

Thanks,
Jim

Re: Diffing two bags?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Hi Jim,
This sounds like a full outer join, with the nulls on the left meaning an
employee is just an employee, and a null on the right meaning a student is
just a student.

On Wed, Nov 25, 2009 at 2:41 PM, James Leek <le...@llnl.gov> wrote:

> Hi, I'm trying to do something with hadoop or pig, that I thought would be
> pretty straightforward, but it turning out to be difficult for me to
> implement.  Of course, I'm very new to this, so I'm probably missing
> something obvious.
>
> What I want to do is a set difference.  I would like to take 2 bags, and
> remove the values they have in common between them.  Let's say I have two
> bags, 'students' and 'employees'.  I want to find which students are just
> students, and which employees are just employees.  So, an example:
>
> Students:
> (Jane)
> (John)
> (Dave)
>
> Employees:
> (Dave)
> (Sue)
> (Anne)
>
> If I were to join these, I would get the students who are also employees,
> or: (Dave).
>
> However, what I want is the distinct values:
>
> Only_Student:
> (Jane)
> (John)
>
> Only_Employee:
> (Sue)
> (Anne)
>
> This should be do-able in a single map-reduce pass, but I found I was going
> to have to write a custom inputter for it so I could remember which values
> were from the students file and which were from the employees file.  (At
> least, I wasn't able to figure that bit out.)  I also wasn't sure how to
> write the output to two separate files.
>
> So I thought pig might have some quick way to do this, but so far I've had
> no luck even expressing set subtraction in pig.  (I could do this less
> efficiently with set subtraction like so: only_employee = employees -
> join(students, employees)  )
>
> Does anyone know what I'm missing?
> Thanks,
> Jim
>