You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Lefty Leverenz <le...@gmail.com> on 2015/04/01 08:13:44 UTC
Re: Standard deviation (STDDEV) function calculation in Hive
Mich, the linked documentation is for Impala, not Hive. (Perhaps Hive is
the same, I don't know.) But the Hive documentation doesn't explain
much: Built-in
Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)>
.
-- Lefty
On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Hi,
>
>
>
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It
> is widely used in trading systems such as FX.
>
>
>
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>
> as below and I quote:
>
>
>
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population
> standard deviation and sample standard deviation, respectively, of the
> input values. (*STDDEV()* *is an alias for STDDEV_SAMP().)* Both
> functions evaluate all input rows matched by the query. The difference is
> that STDDEV_SAMP() is scaled by 1/(N-1) while STDDEV_POP() is scaled by
> 1/N.
>
>
>
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>
>
>
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed
> to STDDEV_SAMP!.
>
>
>
> The following calculation shows this
>
>
>
> SELECT
>
> rs.Customer_ID
>
> , rs.Total_customer_amount
>
> , rs.stddev
>
> , rs.sdddev_samp
>
> , rs.mystddev
>
> FROM
>
> (
>
> SELECT cust_id AS Customer_ID,
>
> COUNT(amount_sold) AS Number_of_orders,
>
> SUM(amount_sold) AS Total_customer_amount,
>
> AVG(amount_sold) AS Average_order,
>
> stddev(amount_sold) AS stddev,
>
> stddev_samp(amount_sold) AS sdddev_samp,
>
> CASE
>
> WHEN COUNT (amount_sold) <= 1
>
> THEN 0
>
> ELSE
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))
>
> END AS mystddev
>
> FROM sales
>
> GROUP BY cust_id
>
> HAVING SUM(amount_sold) > 94000
>
> AND AVG(amount_sold) < stddev(amount_sold)
>
> ) rs
>
> ORDER BY
>
> rs.Total_customer_amount DESC
>
> ;
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | rs.customer_id | rs.total_customer_amount | rs.stddev |
> rs.sdddev_samp | rs.mystddev |
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | 11407.0 | 103412.65999999995 | 622.221465710723 |
> 623.4797510518939 | 623.4797510518938 |
>
> | 10747.0 | 99578.08999999997 | 600.7615005975689 |
> 601.9383117167412 | 601.9383117167412 |
>
>
>
> OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are
> practically identical, whereas what is referred to as rs.stddev in Hive
> is not the one used in industry?
>
>
>
> To show I ran the same in Oracle and the below is the result.
>
>
>
> Connected to:
>
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit
> Production
>
> With the Partitioning, OLAP, Data Mining and Real Application Testing
> options
>
>
>
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT STDDEV
> MYSTDDEV
>
> ----------- --------------------- ---------------------
> ---------------------
>
> 11407 103412.66 623.4797510518940
> 623.4797510518940
>
> 10747 99578.09 601.9383117167410
> 601.9383117167410
>
>
>
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>
>
>
> Thanks,
>
>
>
> Mich Talebzadeh
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Publications due shortly:*
>
> *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache*
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>
RE: Standard deviation (STDDEV) function calculation in Hive
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Lefty,
The Hive aggregate functions as you provide just states:
DOUBLE
stddev_pop(col)
Returns the standard deviation of a numeric column in the group.
DOUBLE
stddev_samp(col)
Returns the unbiased sample standard deviation of a numeric column in the group.
There is no mention of STDDEV here. So this is what I did with the previous query runni9ng and displaying STDDEV, STDDEV_SAMP and STDDDEV_POP
0: jdbc:hive2://rhes564:10010/default> SELECT cust_id AS Customer_ID,
. . . . . . . . . . . . . . . . . . .> SUM(amount_sold) AS Total_customer_amount,
. . . . . . . . . . . . . . . . . . .> AVG(amount_sold) AS Average_order,
. . . . . . . . . . . . . . . . . . .> STDDEV(amount_sold) AS STDDEV,
. . . . . . . . . . . . . . . . . . .> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
. . . . . . . . . . . . . . . . . . .> STDDEV_POP(amount_sold) AS STDDEV_POP
. . . . . . . . . . . . . . . . . . .> FROM sales
. . . . . . . . . . . . . . . . . . .> GROUP BY cust_id
. . . . . . . . . . . . . . . . . . .> HAVING SUM(amount_sold) > 94000
. . . . . . . . . . . . . . . . . . .> AND AVG(amount_sold) < STDDEV_SAMP(amount_sold)
. . . . . . . . . . . . . . . . . . .> limit 3
. . . . . . . . . . . . . . . . . . .> ;
+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+
| customer_id | total_customer_amount | average_order | stddev | stddev_samp | stddev_pop |
+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+
| 429.0 | 94819.41000000006 | 410.4736363636366 | 613.7057080691426 | 615.0384039014772 | 613.7057080691426 |
| 1743.0 | 94786.12999999993 | 398.2610504201678 | 581.0439095219863 | 582.2684502048478 | 581.0439095219863 |
| 2994.0 | 94862.61000000006 | 417.89696035242315 | 623.1607772763742 | 624.5379298449825 | 623.1607772763742 |
+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+
3 rows selected (205.266 seconds)
So according to above what Hive calls STDDEV is a pointer to STDDEV_POP.
But Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP as shown in the results below from Oracle and Sybase respectively
CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT AVERAGE_ORDER STDDEV STDDEV_SAMP STDDEV_POP
----------- --------------------- ------------- ---------- ----------- ----------
429 94819.41 410.473636 615.038404 615.038404 613.705708
1743 94786.13 398.26105 582.26845 582.26845 581.04391
2994 94862.61 417.89696 624.53793 624.53793 623.160777
Customer_ID Total_customer_amount Average_order STDDEV STDDEV_SAMP STDDEV_POP
-------------------- ----------------------------------------- ---------------------------------------------------- --------------------------- --------------------------- ---------------------------
429 94819.41 410.4736363636363 615.038404 615.038404 613.705708
1743 94786.13 398.2610504201680 582.268450 582.268450 581.043910
2994 94862.61 417.8969603524229 624.537930 624.537930 623.160777
So may be the point goes beyond Hive documentation. The value provided by STDDEV in Hive does not appear to be industry standard
HTH
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
Sent: 01 April 2015 07:14
To: user@hive.apache.org
Subject: Re: Standard deviation (STDDEV) function calculation in Hive
Mich, the linked documentation is for Impala, not Hive. (Perhaps Hive is the same, I don't know.) But the Hive documentation doesn't explain much: Built-in Aggregate Functions (UDAF) <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)> .
-- Lefty
On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
Hi,
Basically, the standard deviation (STDDEV) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. It is widely used in trading systems such as FX.
STDEDV in Hive is explained here <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html> as below and I quote:
The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard deviation and sample standard deviation, respectively, of the input values. (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input rows matched by the query. The difference is that STDDEV_SAMP() is scaled by 1/(N-1) while STDDEV_POP() is scaled by 1/N.
Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and according to above it ought to be STDDEV_SAMP.
However, when I work these out (and also use straight forward calculation myself), it turns out that the alias seems to be to STDDVE_POP as opposed to STDDEV_SAMP!.
The following calculation shows this
SELECT
rs.Customer_ID
, rs.Total_customer_amount
, rs.stddev
, rs.sdddev_samp
, rs.mystddev
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev(amount_sold) AS stddev,
stddev_samp(amount_sold) AS sdddev_samp,
CASE
WHEN COUNT (amount_sold) <= 1
THEN 0
ELSE SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))
END AS mystddev
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev(amount_sold)
) rs
ORDER BY
rs.Total_customer_amount DESC
;
+-----------------+---------------------------+--------------------+--------------------+--------------------+--+
| rs.customer_id | rs.total_customer_amount | rs.stddev | rs.sdddev_samp | rs.mystddev |
+-----------------+---------------------------+--------------------+--------------------+--------------------+--+
| 11407.0 | 103412.65999999995 | 622.221465710723 | 623.4797510518939 | 623.4797510518938 |
| 10747.0 | 99578.08999999997 | 600.7615005975689 | 601.9383117167412 | 601.9383117167412 |
OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are practically identical, whereas what is referred to as rs.stddev in Hive is not the one used in industry?
To show I ran the same in Oracle and the below is the result.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT STDDEV MYSTDDEV
----------- --------------------- --------------------- ---------------------
11407 103412.66 623.4797510518940 623.4797510518940
10747 99578.09 601.9383117167410 601.9383117167410
So sounds like for one reason or other what is called STDDEV in Hive and aliased to STDDEV_SAMP is incorrect?
Thanks,
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Re: Standard deviation (STDDEV) function calculation in Hive
Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi Lefty,
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_pop", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_samp", new GenericUDAFStdSample());
Looks like stddev() in hive is stddev_pop(), you can update the UDF docs to
match the FunctionRegistry aliases.
Cheers,
Gopal
From: Lefty Leverenz <le...@gmail.com>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Tuesday, March 31, 2015 at 11:13 PM
To: "user@hive.apache.org" <us...@hive.apache.org>
Subject: Re: Standard deviation (STDDEV) function calculation in Hive
Mich, the linked documentation is for Impala, not Hive. (Perhaps Hive is
the same, I don't know.) But the Hive documentation doesn't explain much:
Built-in Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Languag
eManualUDF-Built-inAggregateFunctions(UDAF)> .
-- Lefty
On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Hi,
>
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It is
> widely used in trading systems such as FX.
>
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-
> 0-x/topics/impala_stddev.html> as below and I quote:
>
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard
> deviation and sample standard deviation, respectively, of the input values.
> (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input
> rows matched by the query. The difference is that STDDEV_SAMP() is scaled by
> 1/(N-1) while STDDEV_POP() is scaled by 1/N.
>
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed to
> STDDEV_SAMP!.
>
> The following calculation shows this
>
> SELECT
> rs.Customer_ID
> , rs.Total_customer_amount
> , rs.stddev
> , rs.sdddev_samp
> , rs.mystddev
> FROM
> (
> SELECT cust_id AS Customer_ID,
> COUNT(amount_sold) AS Number_of_orders,
> SUM(amount_sold) AS Total_customer_amount,
> AVG(amount_sold) AS Average_order,
> stddev(amount_sold) AS stddev,
> stddev_samp(amount_sold) AS sdddev_samp,
> CASE
> WHEN COUNT (amount_sold) <= 1
> THEN 0
> ELSE
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1
> )-1))
> END AS mystddev
> FROM sales
> GROUP BY cust_id
> HAVING SUM(amount_sold) > 94000
> AND AVG(amount_sold) < stddev(amount_sold)
> ) rs
> ORDER BY
> rs.Total_customer_amount DESC
> ;
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | rs.customer_id | rs.total_customer_amount | rs.stddev |
> rs.sdddev_samp | rs.mystddev |
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | 11407.0 | 103412.65999999995 | 622.221465710723 |
> 623.4797510518939 | 623.4797510518938 |
> | 10747.0 | 99578.08999999997 | 600.7615005975689 |
> 601.9383117167412 | 601.9383117167412 |
>
> OK so looking above, we notice that rs.sdddev_sampandrs.mystddevare
> practically identical, whereas what is referred to as rs.stddev in Hive is not
> the one used in industry?
>
> To show I ran the same in Oracle and the below is the result.
>
> Connected to:
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
> With the Partitioning, OLAP, Data Mining and Real Application Testing options
>
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT STDDEV MYSTDDEV
> ----------- --------------------- --------------------- ---------------------
> 11407 103412.66 623.4797510518940 623.4797510518940
> 10747 99578.09 601.9383117167410 601.9383117167410
>
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>
> Thanks,
>
> Mich Talebzadeh
>
> http://talebzadehmich.wordpress.com
>
> Publications due shortly:
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
> or their employees, unless expressly so stated. It is the responsibility of
> the recipient to ensure that this email is virus free, therefore neither
> Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>