You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by buddhika chamith <ch...@gmail.com> on 2013/01/12 06:01:57 UTC

Incremental Data Processing With Hive UDAF

Hi All,

In order to achieve above I am researching on the feasibility of using a
set of custom UADFs for distributive aggregate operations (e.g: sum, count
etc..). Idea is to incorporate some state persisted from earlier
aggregations to the current aggregation value inside merge of the UDAF. For
distributing state data I was thinking of utilizing Hadoop distributed
cache. But I am not sure about how exactly UDAF's are executed at runtime.
Would including the logic to add the persisted state to the current result
at terminate() ensure that it would be added only once? (Assuming all the
aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
is there better way of achieving the same?

Regards
Buddhika

Re: Incremental Data Processing With Hive UDAF

Posted by buddhika chamith <ch...@gmail.com>.
Hi All,

Greatly appreciate any feedback on this. May be this may sound infeasible.
Just wanted check with the experts on this. Anyway the problem of
incremental data processing is a very interesting one if it can be
accommodated for.

Best Regards
Buddhika

On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith
<ch...@gmail.com>wrote:

> Hi All,
>
> After digging in to the code more I realized that GroupbyOperator can be
> present at the map side of the computation as well, in which case it's
> doing partial computations. So in that case the terminate of UDAF will get
> called for partial results. However for the queries that I tried the
> terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
> of the computation finishes with fully completed aggregation results as
> expected. Can be behaviour be expected in any query? (Reduce side computing
> fully aggregated result for any aggregation function)
>
> The problem I am having is that I need a point where previous aggregation
> results to be merged with the current run results. But since terminate can
> behave bit differently depending on whether it's in map side or reduce side
> would it make sense to selectively add this logic at reduce side based on
> some configuration property? (I see property mapred.task.is.map can be of
> potential use here).
>
> Also there needs to be some identifier to uniquely identify the
> aggregation UDAF in operator tree so that the previous aggregations can be
> fetched from the result cache using that identifier. Is there such
> possibility where aggregation function can be uniquely identified within
> the query?
>
> I realize this might be a long shot but I am still up for it if this is
> feasible albeit with some work. Or any other possible ways to achieve this
> is highly appreciated.
>
> Regards
> Buddhika
>
>
> On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <chamibuddhika@gmail.com
> > wrote:
>
>> Any suggestions on this are greatly appreciated. Any one see major road
>> blocks on this?
>>
>> Regards
>> Buddhika
>>
>>
>> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
>> chamibuddhika@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> In order to achieve above I am researching on the feasibility of using a
>>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>>> etc..). Idea is to incorporate some state persisted from earlier
>>> aggregations to the current aggregation value inside merge of the UDAF. For
>>> distributing state data I was thinking of utilizing Hadoop distributed
>>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>>> Would including the logic to add the persisted state to the current result
>>> at terminate() ensure that it would be added only once? (Assuming all the
>>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>>> is there better way of achieving the same?
>>>
>>> Regards
>>> Buddhika
>>>
>>
>>
>

Re: Incremental Data Processing With Hive UDAF

Posted by buddhika chamith <ch...@gmail.com>.
Hi All,

After digging in to the code more I realized that GroupbyOperator can be
present at the map side of the computation as well, in which case it's
doing partial computations. So in that case the terminate of UDAF will get
called for partial results. However for the queries that I tried the
terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
of the computation finishes with fully completed aggregation results as
expected. Can be behaviour be expected in any query? (Reduce side computing
fully aggregated result for any aggregation function)

The problem I am having is that I need a point where previous aggregation
results to be merged with the current run results. But since terminate can
behave bit differently depending on whether it's in map side or reduce side
would it make sense to selectively add this logic at reduce side based on
some configuration property? (I see property mapred.task.is.map can be of
potential use here).

Also there needs to be some identifier to uniquely identify the aggregation
UDAF in operator tree so that the previous aggregations can be fetched from
the result cache using that identifier. Is there such possibility where
aggregation function can be uniquely identified within the query?

I realize this might be a long shot but I am still up for it if this is
feasible albeit with some work. Or any other possible ways to achieve this
is highly appreciated.

Regards
Buddhika

On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith
<ch...@gmail.com>wrote:

> Any suggestions on this are greatly appreciated. Any one see major road
> blocks on this?
>
> Regards
> Buddhika
>
>
> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
> chamibuddhika@gmail.com> wrote:
>
>> Hi All,
>>
>> In order to achieve above I am researching on the feasibility of using a
>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>> etc..). Idea is to incorporate some state persisted from earlier
>> aggregations to the current aggregation value inside merge of the UDAF. For
>> distributing state data I was thinking of utilizing Hadoop distributed
>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>> Would including the logic to add the persisted state to the current result
>> at terminate() ensure that it would be added only once? (Assuming all the
>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>> is there better way of achieving the same?
>>
>> Regards
>> Buddhika
>>
>
>

Re: Incremental Data Processing With Hive UDAF

Posted by buddhika chamith <ch...@gmail.com>.
Any suggestions on this are greatly appreciated. Any one see major road
blocks on this?

Regards
Buddhika

On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith
<ch...@gmail.com>wrote:

> Hi All,
>
> In order to achieve above I am researching on the feasibility of using a
> set of custom UADFs for distributive aggregate operations (e.g: sum, count
> etc..). Idea is to incorporate some state persisted from earlier
> aggregations to the current aggregation value inside merge of the UDAF. For
> distributing state data I was thinking of utilizing Hadoop distributed
> cache. But I am not sure about how exactly UDAF's are executed at runtime.
> Would including the logic to add the persisted state to the current result
> at terminate() ensure that it would be added only once? (Assuming all the
> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
> is there better way of achieving the same?
>
> Regards
> Buddhika
>