You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dan Milstein <dm...@hubteam.com> on 2009/04/23 22:44:42 UTC

Writing a New Aggregate Function

Hello all,

I've been using streaming + the aggregate package (available via - 
reducer aggregate), and have been very happy with what it gives me.

I'm interested in writing my own new aggregate functions (in Java)  
which I could then access from my streaming code.

Can anyone give me pointers towards how to make that happen?  I've  
read through the aggregate package source, but I'm not seeing how to  
define my own, and get access to it from streaming.

To be specific, here's the sort of thing I'd like to be able to do:

  - In Java, define a SampleValues aggregator, which chooses a sample  
of the input given to it

  - From my streaming program, in say python, output:

SampleValues:some_key \t some_value

  - Have the aggregate framework somehow call my new aggregator for  
the combiner and reducer steps

Thanks,
-Dan Milstein

Re: Writing a New Aggregate Function

Posted by Runping Qi <ru...@gmail.com>.
You are right; you have to patch the code  in the aggregate package.


On Fri, Apr 24, 2009 at 10:24 AM, Dan Milstein <dm...@hubteam.com>wrote:

> Runping,
>
> Thanks for the response.  A question about case (2) below, (which is, in
> fact, what I want to do):
>
>  - Is there any way to do this without patching the code within the
> aggregator package?
>
> It sure doesn't look like it, but just to make sure.
>
> Thanks again,
> -Dan M
>
>
> On Apr 24, 2009, at 12:56 PM, Runping Qi wrote:
>
>  A couple of general goals behind of the aggregate package:
>>
>> 1. If you are application developers using aggregate package, you only
>> need
>> to develop your own (user defined) valuator descriptor classes, which are
>> typically sub class of ValueAggregatorDescriptor. You can use
>> the existing aggregator types (such as  LongValueSum, ValueHistogram,
>> etc.)
>>
>> 2. If you want to contribute new types of aggregator (for example, an
>> ValueAverage class that keeps track the average of values will be a much
>> needed one), then you need to implement a class tham implements
>> ValueAggregator class, and to update the generateValueAggregator method of
>> ValueAggregatorBaseDescriptor to handle your new aggregators.
>>
>> 3. If you want to contribute to the aggregate framework itsself, you may
>> need to touch every bit of the code in the package.
>>
>> Runping
>>
>>
>>
>> On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein <dm...@hubteam.com>
>> wrote:
>>
>>  Hello all,
>>>
>>> I've been using streaming + the aggregate package (available via -reducer
>>> aggregate), and have been very happy with what it gives me.
>>>
>>> I'm interested in writing my own new aggregate functions (in Java) which
>>> I
>>> could then access from my streaming code.
>>>
>>> Can anyone give me pointers towards how to make that happen?  I've read
>>> through the aggregate package source, but I'm not seeing how to define my
>>> own, and get access to it from streaming.
>>>
>>> To be specific, here's the sort of thing I'd like to be able to do:
>>>
>>> - In Java, define a SampleValues aggregator, which chooses a sample of
>>> the
>>> input given to it
>>>
>>> - From my streaming program, in say python, output:
>>>
>>> SampleValues:some_key \t some_value
>>>
>>> - Have the aggregate framework somehow call my new aggregator for the
>>> combiner and reducer steps
>>>
>>> Thanks,
>>> -Dan Milstein
>>>
>>>
>

Re: Writing a New Aggregate Function

Posted by Dan Milstein <dm...@hubteam.com>.
Runping,

Thanks for the response.  A question about case (2) below, (which is,  
in fact, what I want to do):

  - Is there any way to do this without patching the code within the  
aggregator package?

It sure doesn't look like it, but just to make sure.

Thanks again,
-Dan M

On Apr 24, 2009, at 12:56 PM, Runping Qi wrote:

> A couple of general goals behind of the aggregate package:
>
> 1. If you are application developers using aggregate package, you  
> only need
> to develop your own (user defined) valuator descriptor classes,  
> which are
> typically sub class of ValueAggregatorDescriptor. You can use
> the existing aggregator types (such as  LongValueSum,  
> ValueHistogram, etc.)
>
> 2. If you want to contribute new types of aggregator (for example, an
> ValueAverage class that keeps track the average of values will be a  
> much
> needed one), then you need to implement a class tham implements
> ValueAggregator class, and to update the generateValueAggregator  
> method of
> ValueAggregatorBaseDescriptor to handle your new aggregators.
>
> 3. If you want to contribute to the aggregate framework itsself, you  
> may
> need to touch every bit of the code in the package.
>
> Runping
>
>
>
> On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein  
> <dm...@hubteam.com> wrote:
>
>> Hello all,
>>
>> I've been using streaming + the aggregate package (available via - 
>> reducer
>> aggregate), and have been very happy with what it gives me.
>>
>> I'm interested in writing my own new aggregate functions (in Java)  
>> which I
>> could then access from my streaming code.
>>
>> Can anyone give me pointers towards how to make that happen?  I've  
>> read
>> through the aggregate package source, but I'm not seeing how to  
>> define my
>> own, and get access to it from streaming.
>>
>> To be specific, here's the sort of thing I'd like to be able to do:
>>
>> - In Java, define a SampleValues aggregator, which chooses a sample  
>> of the
>> input given to it
>>
>> - From my streaming program, in say python, output:
>>
>> SampleValues:some_key \t some_value
>>
>> - Have the aggregate framework somehow call my new aggregator for the
>> combiner and reducer steps
>>
>> Thanks,
>> -Dan Milstein
>>


Re: Writing a New Aggregate Function

Posted by Runping Qi <ru...@gmail.com>.
A couple of general goals behind of the aggregate package:

1. If you are application developers using aggregate package, you only need
to develop your own (user defined) valuator descriptor classes, which are
typically sub class of ValueAggregatorDescriptor. You can use
the existing aggregator types (such as  LongValueSum, ValueHistogram, etc.)

2. If you want to contribute new types of aggregator (for example, an
ValueAverage class that keeps track the average of values will be a much
needed one), then you need to implement a class tham implements
ValueAggregator class, and to update the generateValueAggregator method of
ValueAggregatorBaseDescriptor to handle your new aggregators.

3. If you want to contribute to the aggregate framework itsself, you may
need to touch every bit of the code in the package.

Runping



On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein <dm...@hubteam.com> wrote:

> Hello all,
>
> I've been using streaming + the aggregate package (available via -reducer
> aggregate), and have been very happy with what it gives me.
>
> I'm interested in writing my own new aggregate functions (in Java) which I
> could then access from my streaming code.
>
> Can anyone give me pointers towards how to make that happen?  I've read
> through the aggregate package source, but I'm not seeing how to define my
> own, and get access to it from streaming.
>
> To be specific, here's the sort of thing I'd like to be able to do:
>
>  - In Java, define a SampleValues aggregator, which chooses a sample of the
> input given to it
>
>  - From my streaming program, in say python, output:
>
> SampleValues:some_key \t some_value
>
>  - Have the aggregate framework somehow call my new aggregator for the
> combiner and reducer steps
>
> Thanks,
> -Dan Milstein
>

Re: Writing a New Aggregate Function

Posted by jason hadoop <ja...@gmail.com>.
It really isn't documented anywhere. There is a small section in my book in
ch08 about it. It didn't make the alpha that is up of ch08 though.

On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein <dm...@hubteam.com> wrote:

> Hello all,
>
> I've been using streaming + the aggregate package (available via -reducer
> aggregate), and have been very happy with what it gives me.
>
> I'm interested in writing my own new aggregate functions (in Java) which I
> could then access from my streaming code.
>
> Can anyone give me pointers towards how to make that happen?  I've read
> through the aggregate package source, but I'm not seeing how to define my
> own, and get access to it from streaming.
>
> To be specific, here's the sort of thing I'd like to be able to do:
>
>  - In Java, define a SampleValues aggregator, which chooses a sample of the
> input given to it
>
>  - From my streaming program, in say python, output:
>
> SampleValues:some_key \t some_value
>
>  - Have the aggregate framework somehow call my new aggregator for the
> combiner and reducer steps
>
> Thanks,
> -Dan Milstein
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422