You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Michael Roessler <mi...@keyevent.com> on 2010/11/16 21:49:31 UTC

Running and Sliding on Aggregates

I suffer from "sql-itis", especially the SQL:2003 variant. I can barely get
a data-related programming line out of my fingers that doesn't include a
"partition by" or an "over order by".

Many of the data sets I analyze can be simplified for comprehension by
thinking of them as having only three columns: a descriptive columns
(ENTITY_NAME), a date column (20101611), and a numeric or double value
column (456.789). A time series. Much of what I do is partition the
descriptive column, order by the date column, and apply a moving aggregate
(like average) over a sliding window of the value column.

Example:

select entity_name, value_date, avg(value) over (partition by entity_name
order by value_date rows between 20 preceding and current row) as
twenty_period_moving_average
from table_with_a_billion_rows

I am aware that HQL does not currently include this type of SQL analytic
function. What are my options, if any, to accomplish the above with Hive?
I'm looking at UDAF, but I haven't yet grokked.

Thanks,

Michael

Michael Roessler, CFA
michael.roessler@keyevent.com

Re: Running and Sliding on Aggregates

Posted by Michael Roessler <mi...@keyevent.com>.

Thanks Jeff. I'll add something to the JIRA case. I'm after what I think are
fairly common use cases, such as:

create table corporate_expenses(cost_center INT, fiscal_date STRING,
expense_amount DOUBLE);

given tab-delimited file data.txt:

SALES          20101118          39392020
SALES          20101116         4848484
SALES          20101005          73737373
R&D              20101118          38383838
R&D              20101116          38383838
R&D              20101005          8383838

load data local inpath '/mypath/data.txt' overwrite into table
corporate_expenses;

*Goal 1: Determine the delta in revenue_amount for each date relative to its
ealiest previous date (have to partition by cost_center):*

SQL using any database that support SQL:1999 (like postgresql) >

select cost_center, fiscal_date, expense_amount / lag(expense_amount) over
(partition by cost_center order by fiscal_date) as delta from
corporate_expenses;

*Goal 2: Give me the 30-period moving average of expense_amount by
cost_center by applying an aggregate over a moving window:*

SQL using any database that support SQL:1999 (like postgresql) >

select cost_center, fiscal_date, avg(expense_amount) over (partition by
cost_center order by fiscal_date rows 29 preceding and current row) as
moving_average from corporate_expenses;

Now, my concern is that I might be rather ignorant about all this. Perhaps
my experience implementing this functionality through SQL is too limited.
Perhaps the rest of you implement this functionality on data sets of
respectable size by other means of which I am ignorant.


On Tue, Nov 16, 2010 at 5:02 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Hey Michael,
>
> I unfortunately can't tell you how to get around this limitation, but I can
> point you to the JIRA issue that attempts to capture the requirements:
> https://issues.apache.org/jira/browse/HIVE-896. If you could describe your
> use case there, we can capture it for posterity in hope that some future
> Hive engineer will find time to crank out a patch.
>
> Thanks,
> Jeff
>
> On Tue, Nov 16, 2010 at 12:49 PM, Michael Roessler <
> michael.roessler@keyevent.com> wrote:
>
>> I suffer from "sql-itis", especially the SQL:2003 variant. I can barely
>> get a data-related programming line out of my fingers that doesn't include a
>> "partition by" or an "over order by".
>>
>> Many of the data sets I analyze can be simplified for comprehension by
>> thinking of them as having only three columns: a descriptive columns
>> (ENTITY_NAME), a date column (20101611), and a numeric or double value
>> column (456.789). A time series. Much of what I do is partition the
>> descriptive column, order by the date column, and apply a moving aggregate
>> (like average) over a sliding window of the value column.
>>
>> Example:
>>
>> select entity_name, value_date, avg(value) over (partition by entity_name
>> order by value_date rows between 20 preceding and current row) as
>> twenty_period_moving_average
>> from table_with_a_billion_rows
>>
>> I am aware that HQL does not currently include this type of SQL analytic
>> function. What are my options, if any, to accomplish the above with Hive?
>> I'm looking at UDAF, but I haven't yet grokked.
>>
>> Thanks,
>>
>> Michael
>>
>> Michael Roessler, CFA
>> michael.roessler@keyevent.com
>>
>>
>

Re: Running and Sliding on Aggregates

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Michael,

I unfortunately can't tell you how to get around this limitation, but I can
point you to the JIRA issue that attempts to capture the requirements:
https://issues.apache.org/jira/browse/HIVE-896. If you could describe your
use case there, we can capture it for posterity in hope that some future
Hive engineer will find time to crank out a patch.

Thanks,
Jeff

On Tue, Nov 16, 2010 at 12:49 PM, Michael Roessler <
michael.roessler@keyevent.com> wrote:

> I suffer from "sql-itis", especially the SQL:2003 variant. I can barely get
> a data-related programming line out of my fingers that doesn't include a
> "partition by" or an "over order by".
>
> Many of the data sets I analyze can be simplified for comprehension by
> thinking of them as having only three columns: a descriptive columns
> (ENTITY_NAME), a date column (20101611), and a numeric or double value
> column (456.789). A time series. Much of what I do is partition the
> descriptive column, order by the date column, and apply a moving aggregate
> (like average) over a sliding window of the value column.
>
> Example:
>
> select entity_name, value_date, avg(value) over (partition by entity_name
> order by value_date rows between 20 preceding and current row) as
> twenty_period_moving_average
> from table_with_a_billion_rows
>
> I am aware that HQL does not currently include this type of SQL analytic
> function. What are my options, if any, to accomplish the above with Hive?
> I'm looking at UDAF, but I haven't yet grokked.
>
> Thanks,
>
> Michael
>
> Michael Roessler, CFA
> michael.roessler@keyevent.com
>
>