You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by dhaval deshpande <dh...@gmail.com> on 2009/11/24 09:56:32 UTC

Wanted to create a custom group function

Hi,
       i wanted to create a custom group function in pig. I was not sure
where to start from. I check some documentation online but couldnt figure
out. I also checked on wiki and it says I need to extend a abstract class
GroupFunc. and when i try to do that it says GroupFunc class doesnt not
exist. Please help me where do I start from.

Thanks,
Dhaval.

Re: Wanted to create a custom group function

Posted by dhaval deshpande <dh...@gmail.com>.
Thanks for the guidance Dmitry. That gives me a picture where to start from
:)

On Tue, Nov 24, 2009 at 3:47 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi Dhaval,
> What do you mean by "a custom group function"? To create a function that
> turns a tuple or a part of a tuple into a key you want to group by, you can
> use a regular EvalFunc. To create a custom aggregation function that
> performs some calculation on the result of grouping, you still write a
> regular EvalFunc, except it must work on a bag. You can implement the
> Algebraic interface to make it run faster:
> http://hadoop.apache.org/pig/docs/r0.5.0/udf.html .
>
> If you are working with the version in trunk, there is another interface
> you
> can implement for further efficiency gains, if it is applicable to your use
> case: http://wiki.apache.org/pig/PigAccumulatorSpec
>
> <http://wiki.apache.org/pig/PigAccumulatorSpec>-Dmitriy
>
> On Tue, Nov 24, 2009 at 3:56 AM, dhaval deshpande <
> dhaval.deshpande@gmail.com> wrote:
>
> > Hi,
> >       i wanted to create a custom group function in pig. I was not sure
> > where to start from. I check some documentation online but couldnt figure
> > out. I also checked on wiki and it says I need to extend a abstract class
> > GroupFunc. and when i try to do that it says GroupFunc class doesnt not
> > exist. Please help me where do I start from.
> >
> > Thanks,
> > Dhaval.
> >
>

Re: Wanted to create a custom group function

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Hi Dhaval,
What do you mean by "a custom group function"? To create a function that
turns a tuple or a part of a tuple into a key you want to group by, you can
use a regular EvalFunc. To create a custom aggregation function that
performs some calculation on the result of grouping, you still write a
regular EvalFunc, except it must work on a bag. You can implement the
Algebraic interface to make it run faster:
http://hadoop.apache.org/pig/docs/r0.5.0/udf.html .

If you are working with the version in trunk, there is another interface you
can implement for further efficiency gains, if it is applicable to your use
case: http://wiki.apache.org/pig/PigAccumulatorSpec

<http://wiki.apache.org/pig/PigAccumulatorSpec>-Dmitriy

On Tue, Nov 24, 2009 at 3:56 AM, dhaval deshpande <
dhaval.deshpande@gmail.com> wrote:

> Hi,
>       i wanted to create a custom group function in pig. I was not sure
> where to start from. I check some documentation online but couldnt figure
> out. I also checked on wiki and it says I need to extend a abstract class
> GroupFunc. and when i try to do that it says GroupFunc class doesnt not
> exist. Please help me where do I start from.
>
> Thanks,
> Dhaval.
>

Re: Diffing two bags?

Posted by Alan Gates <ga...@yahoo-inc.com>.
Do you want to keep the distinct values separate by input, or mingle  
them?  The following script will keep them separate.
A = load 'students' as (name);
B = load 'employees' as (name);
C = cogroup A by name, B by name;
D = filter C by IsEmpty(A);
E = foreach D generate flatten(B);
store E into 'only_employees';
F = filter C by IsEmpty(B);
G = foreach F flatten(A);
store G into 'only_students';

to mingle them replace the two store calls by:

H = union E, G;
store H into 'only_employees_or_students';

Alan.

On Nov 25, 2009, at 11:41 AM, James Leek wrote:

> Hi, I'm trying to do something with hadoop or pig, that I thought  
> would be pretty straightforward, but it turning out to be difficult  
> for me to implement.  Of course, I'm very new to this, so I'm  
> probably missing something obvious.
>
> What I want to do is a set difference.  I would like to take 2 bags,  
> and remove the values they have in common between them.  Let's say I  
> have two bags, 'students' and 'employees'.  I want to find which  
> students are just students, and which employees are just employees.   
> So, an example:
>
> Students:
> (Jane)
> (John)
> (Dave)
>
> Employees:
> (Dave)
> (Sue)
> (Anne)
>
> If I were to join these, I would get the students who are also  
> employees, or: (Dave).
>
> However, what I want is the distinct values:
>
> Only_Student:
> (Jane)
> (John)
>
> Only_Employee:
> (Sue)
> (Anne)
>
> This should be do-able in a single map-reduce pass, but I found I  
> was going to have to write a custom inputter for it so I could  
> remember which values were from the students file and which were  
> from the employees file.  (At least, I wasn't able to figure that  
> bit out.)  I also wasn't sure how to write the output to two  
> separate files.
>
> So I thought pig might have some quick way to do this, but so far  
> I've had no luck even expressing set subtraction in pig.  (I could  
> do this less efficiently with set subtraction like so: only_employee  
> = employees - join(students, employees)  )
>
> Does anyone know what I'm missing?
> Thanks,
> Jim


Re: Diffing two bags?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Alan's use of cogroup is better and more "piggly".
I'm still mentally sql-bound....

-D

On Wed, Nov 25, 2009 at 2:58 PM, James Leek <le...@llnl.gov> wrote:

> Dmitriy Ryaboy wrote:
>
>> Hi Jim,
>> This sounds like a full outer join, with the nulls on the left meaning an
>> employee is just an employee, and a null on the right meaning a student is
>> just a student.
>>
>>
> Ah, good call.  Thanks.  I see there is a way to do outer join given in the
> pig latin manual.
>
> Thanks,
> Jim
>

Re: Diffing two bags?

Posted by James Leek <le...@llnl.gov>.
Dmitriy Ryaboy wrote:
> Hi Jim,
> This sounds like a full outer join, with the nulls on the left meaning an
> employee is just an employee, and a null on the right meaning a student is
> just a student.
>   
Ah, good call.  Thanks.  I see there is a way to do outer join given in 
the pig latin manual.

Thanks,
Jim

Re: Diffing two bags?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Hi Jim,
This sounds like a full outer join, with the nulls on the left meaning an
employee is just an employee, and a null on the right meaning a student is
just a student.

On Wed, Nov 25, 2009 at 2:41 PM, James Leek <le...@llnl.gov> wrote:

> Hi, I'm trying to do something with hadoop or pig, that I thought would be
> pretty straightforward, but it turning out to be difficult for me to
> implement.  Of course, I'm very new to this, so I'm probably missing
> something obvious.
>
> What I want to do is a set difference.  I would like to take 2 bags, and
> remove the values they have in common between them.  Let's say I have two
> bags, 'students' and 'employees'.  I want to find which students are just
> students, and which employees are just employees.  So, an example:
>
> Students:
> (Jane)
> (John)
> (Dave)
>
> Employees:
> (Dave)
> (Sue)
> (Anne)
>
> If I were to join these, I would get the students who are also employees,
> or: (Dave).
>
> However, what I want is the distinct values:
>
> Only_Student:
> (Jane)
> (John)
>
> Only_Employee:
> (Sue)
> (Anne)
>
> This should be do-able in a single map-reduce pass, but I found I was going
> to have to write a custom inputter for it so I could remember which values
> were from the students file and which were from the employees file.  (At
> least, I wasn't able to figure that bit out.)  I also wasn't sure how to
> write the output to two separate files.
>
> So I thought pig might have some quick way to do this, but so far I've had
> no luck even expressing set subtraction in pig.  (I could do this less
> efficiently with set subtraction like so: only_employee = employees -
> join(students, employees)  )
>
> Does anyone know what I'm missing?
> Thanks,
> Jim
>

Diffing two bags?

Posted by James Leek <le...@llnl.gov>.
Hi, I'm trying to do something with hadoop or pig, that I thought would 
be pretty straightforward, but it turning out to be difficult for me to 
implement.  Of course, I'm very new to this, so I'm probably missing 
something obvious.

What I want to do is a set difference.  I would like to take 2 bags, and 
remove the values they have in common between them.  Let's say I have 
two bags, 'students' and 'employees'.  I want to find which students are 
just students, and which employees are just employees.  So, an example:

Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also 
employees, or: (Dave).

However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)

This should be do-able in a single map-reduce pass, but I found I was 
going to have to write a custom inputter for it so I could remember 
which values were from the students file and which were from the 
employees file.  (At least, I wasn't able to figure that bit out.)  I 
also wasn't sure how to write the output to two separate files.

So I thought pig might have some quick way to do this, but so far I've 
had no luck even expressing set subtraction in pig.  (I could do this 
less efficiently with set subtraction like so: only_employee = employees 
- join(students, employees)  )

Does anyone know what I'm missing?
Thanks,
Jim

Re: Wanted to create a custom group function

Posted by dhaval deshpande <dh...@gmail.com>.
Thanks Allan I will try out the way you suggested. I found the GroupFunc at
http://wiki.apache.org/pig/GroupFunction. I hope that helps.

On Tue, Nov 24, 2009 at 10:15 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Originally Pig had the concept of a GroupFunc, similar to EvalFunc and
> FilterFunc.  That was removed between 0.1 and 0.2.  Where did you still find
> this referenced in the documentation?  We should remove it to avoid
> confusion.
>
> You can group using a UDF to transform your group by key:
>
> A = load 'mydata';
> B = group A by myUDF($0);
> ...
>
> myUDF needs to extend EvalFunc in this case.
>
> Alan.
>
>
> On Nov 24, 2009, at 12:56 AM, dhaval deshpande wrote:
>
>  Hi,
>>      i wanted to create a custom group function in pig. I was not sure
>> where to start from. I check some documentation online but couldnt figure
>> out. I also checked on wiki and it says I need to extend a abstract class
>> GroupFunc. and when i try to do that it says GroupFunc class doesnt not
>> exist. Please help me where do I start from.
>>
>> Thanks,
>> Dhaval.
>>
>
>

Re: Wanted to create a custom group function

Posted by Alan Gates <ga...@yahoo-inc.com>.
Originally Pig had the concept of a GroupFunc, similar to EvalFunc and  
FilterFunc.  That was removed between 0.1 and 0.2.  Where did you  
still find this referenced in the documentation?  We should remove it  
to avoid confusion.

You can group using a UDF to transform your group by key:

A = load 'mydata';
B = group A by myUDF($0);
...

myUDF needs to extend EvalFunc in this case.

Alan.

On Nov 24, 2009, at 12:56 AM, dhaval deshpande wrote:

> Hi,
>       i wanted to create a custom group function in pig. I was not  
> sure
> where to start from. I check some documentation online but couldnt  
> figure
> out. I also checked on wiki and it says I need to extend a abstract  
> class
> GroupFunc. and when i try to do that it says GroupFunc class doesnt  
> not
> exist. Please help me where do I start from.
>
> Thanks,
> Dhaval.