You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Igor Tatarinov <ig...@decide.com> on 2011/02/22 07:45:20 UTC

implementing moving average as a UDF

I would like to implement the moving average as a UDF (instead of a
streaming reducer). Here is what I am thinking. Please let me know if I am
missing something here:

SELECT product, date, mavg(product, price, 10)
FROM (
  SELECT *
  FROM prices
  DISTRIBUTE BY product
  SORT BY product, date
)

I have to pass the key to mavg() because it has to detect when one product
grouping ends and another starts.

Unfortunately, mavg will also need to maintain a state (moving sum and
count). That's where I am worried that Hive (Hadoop?) will use a single
instance of my UDF to process concurrent groupings and this idea won't
work.

Is that the main issue? Is there something I can do to fix that?

Thanks!
igor

Re: implementing moving average as a UDF

Posted by John Sichi <js...@fb.com>.

Yes, your query makes sense and should already work as expected.  The idea of HIVE-1994 is that once the new annotation is available, we'll make a guarantee that your query as written below will continue to work in the face of any new optimizer changes (with the downside being that in some cases you won't be able to take advantage of such optimizer changes).

Each mapper or reducer gets its own instance of the UDF, so (a) you don't have to worry about any unwanted sharing between them and (b) you have to make sure that your DISTRIBUTE/SORT clauses are present and correct (Hive won't know anything about the dependency).

Long term, an implementation of the SQL/OLAP frameworks would be preferable since it would allow Hive to fully understand the semantics and apply all relevant validations and optimizations transparently, but in the meantime, stateful UDF's will be the duct tape.

JVS

On Feb 22, 2011, at 11:55 AM, Igor Tatarinov wrote:

> Thank you, John.
> 
> It's not quite clear from the page whether my solution:
> 1. makes sense
> 2. works now
> 3. will work in the future if the issue is resolved/implemented
> 
> Could you elaborate?
> 
> Also, there is no mentioning of UDF object sharing (between mappers) in the current implementation. Is this a problem? do I need to use ThreadLocal or something like that?
> 
> On Tue, Feb 22, 2011 at 11:42 AM, John Sichi <js...@fb.com> wrote:
> Please see the discussion in this JIRA issue:
> 
> https://issues.apache.org/jira/browse/HIVE-1994
> 
> JVS
> 
> On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote:
> 
> > I would like to implement the moving average as a UDF (instead of a streaming reducer). Here is what I am thinking. Please let me know if I am missing something here:
> >
> > SELECT product, date, mavg(product, price, 10)
> > FROM (
> >   SELECT *
> >   FROM prices
> >   DISTRIBUTE BY product
> >   SORT BY product, date
> > )
> >
> > I have to pass the key to mavg() because it has to detect when one product grouping ends and another starts.
> >
> > Unfortunately, mavg will also need to maintain a state (moving sum and count). That's where I am worried that Hive (Hadoop?) will use a single instance of my UDF to process concurrent groupings and this idea won't work.
> >
> > Is that the main issue? Is there something I can do to fix that?
> >
> > Thanks!
> > igor
> >
> 
>

Re: implementing moving average as a UDF

Posted by Igor Tatarinov <ig...@decide.com>.

Thank you, John.

It's not quite clear from the page whether my solution:
1. makes sense
2. works now
3. will work in the future if the issue is resolved/implemented

Could you elaborate?

Also, there is no mentioning of UDF object sharing (between mappers) in the
current implementation. Is this a problem? do I need to use ThreadLocal or
something like that?

On Tue, Feb 22, 2011 at 11:42 AM, John Sichi <js...@fb.com> wrote:

> Please see the discussion in this JIRA issue:
>
> https://issues.apache.org/jira/browse/HIVE-1994
>
> JVS
>
> On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote:
>
> > I would like to implement the moving average as a UDF (instead of a
> streaming reducer). Here is what I am thinking. Please let me know if I am
> missing something here:
> >
> > SELECT product, date, mavg(product, price, 10)
> > FROM (
> >   SELECT *
> >   FROM prices
> >   DISTRIBUTE BY product
> >   SORT BY product, date
> > )
> >
> > I have to pass the key to mavg() because it has to detect when one
> product grouping ends and another starts.
> >
> > Unfortunately, mavg will also need to maintain a state (moving sum and
> count). That's where I am worried that Hive (Hadoop?) will use a single
> instance of my UDF to process concurrent groupings and this idea won't work.
> >
> > Is that the main issue? Is there something I can do to fix that?
> >
> > Thanks!
> > igor
> >
>
>

Re: implementing moving average as a UDF

Posted by John Sichi <js...@fb.com>.

Please see the discussion in this JIRA issue:

https://issues.apache.org/jira/browse/HIVE-1994

JVS

On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote:

> I would like to implement the moving average as a UDF (instead of a streaming reducer). Here is what I am thinking. Please let me know if I am missing something here:
> 
> SELECT product, date, mavg(product, price, 10)
> FROM (
>   SELECT *
>   FROM prices
>   DISTRIBUTE BY product
>   SORT BY product, date
> )
> 
> I have to pass the key to mavg() because it has to detect when one product grouping ends and another starts.
> 
> Unfortunately, mavg will also need to maintain a state (moving sum and count). That's where I am worried that Hive (Hadoop?) will use a single instance of my UDF to process concurrent groupings and this idea won't work. 
> 
> Is that the main issue? Is there something I can do to fix that?
> 
> Thanks!
> igor
>