You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Lefty Leverenz <le...@gmail.com> on 2015/04/01 08:13:44 UTC

Re: Standard deviation (STDDEV) function calculation in Hive

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is
the same, I don't know.)  But the Hive documentation doesn't explain
much:  Built-in
Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)>
.

-- Lefty

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It
> is widely used in trading systems such as FX.
>
>
>
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>
> as below and I quote:
>
>
>
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population
> standard deviation and sample standard deviation, respectively, of the
> input values. (*STDDEV()* *is an alias for STDDEV_SAMP().)* Both
> functions evaluate all input rows matched by the query. The difference is
> that STDDEV_SAMP() is scaled by 1/(N-1) while STDDEV_POP() is scaled by
> 1/N.
>
>
>
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>
>
>
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed
> to STDDEV_SAMP!.
>
>
>
> The following calculation shows this
>
>
>
> SELECT
>
>           rs.Customer_ID
>
>         , rs.Total_customer_amount
>
>        , rs.stddev
>
>         , rs.sdddev_samp
>
>         , rs.mystddev
>
> FROM
>
> (
>
>         SELECT cust_id AS Customer_ID,
>
>         COUNT(amount_sold) AS Number_of_orders,
>
>         SUM(amount_sold) AS Total_customer_amount,
>
>         AVG(amount_sold) AS Average_order,
>
>         stddev(amount_sold) AS stddev,
>
>         stddev_samp(amount_sold) AS sdddev_samp,
>
>         CASE
>
>         WHEN  COUNT (amount_sold) <= 1
>
>            THEN  0
>
>          ELSE
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))
>
>          END AS mystddev
>
>         FROM sales
>
>         GROUP BY cust_id
>
>         HAVING SUM(amount_sold) > 94000
>
>         AND AVG(amount_sold) < stddev(amount_sold)
>
> ) rs
>
> ORDER BY
>
>           rs.Total_customer_amount DESC
>
> ;
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |
> rs.sdddev_samp   |    rs.mystddev     |
>
>
> +-----------------+---------------------------+--------------------+--------------------+--------------------+--+
>
> | 11407.0         | 103412.65999999995        | 622.221465710723   |
> 623.4797510518939  | 623.4797510518938  |
>
> | 10747.0         | 99578.08999999997         | 600.7615005975689  |
> 601.9383117167412  | 601.9383117167412  |
>
>
>
> OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are
> practically identical, whereas what is referred to as rs.stddev in Hive
> is not the one used in industry?
>
>
>
> To show I ran the same in Oracle and the below is the result.
>
>
>
> Connected to:
>
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit
> Production
>
> With the Partitioning, OLAP, Data Mining and Real Application Testing
> options
>
>
>
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV
> MYSTDDEV
>
> ----------- --------------------- ---------------------
> ---------------------
>
>       11407             103412.66     623.4797510518940
> 623.4797510518940
>
>       10747              99578.09     601.9383117167410
> 601.9383117167410
>
>
>
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>
>
>
> Thanks,
>
>
>
> Mich Talebzadeh
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Publications due shortly:*
>
> *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache*
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>

RE: Standard deviation (STDDEV) function calculation in Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Lefty,

 

The Hive aggregate functions as you provide just states:

 


DOUBLE

stddev_pop(col)

Returns the standard deviation of a numeric column in the group.


DOUBLE

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group.

 

There is no mention of STDDEV here. So this is what I did with the previous query runni9ng and displaying STDDEV, STDDEV_SAMP and STDDDEV_POP

 

0: jdbc:hive2://rhes564:10010/default> SELECT cust_id AS Customer_ID,

. . . . . . . . . . . . . . . . . . .> SUM(amount_sold) AS Total_customer_amount,

. . . . . . . . . . . . . . . . . . .> AVG(amount_sold) AS Average_order,

. . . . . . . . . . . . . . . . . . .> STDDEV(amount_sold) AS STDDEV,

. . . . . . . . . . . . . . . . . . .> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

. . . . . . . . . . . . . . . . . . .> STDDEV_POP(amount_sold) AS STDDEV_POP

. . . . . . . . . . . . . . . . . . .> FROM sales

. . . . . . . . . . . . . . . . . . .> GROUP BY cust_id

. . . . . . . . . . . . . . . . . . .> HAVING SUM(amount_sold) > 94000

. . . . . . . . . . . . . . . . . . .> AND AVG(amount_sold) < STDDEV_SAMP(amount_sold)

. . . . . . . . . . . . . . . . . . .> limit 3

. . . . . . . . . . . . . . . . . . .> ;

 

 

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| customer_id  | total_customer_amount  |    average_order    |       stddev       |    stddev_samp     |     stddev_pop     |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| 429.0        | 94819.41000000006      | 410.4736363636366   | 613.7057080691426  | 615.0384039014772  | 613.7057080691426  |

| 1743.0       | 94786.12999999993      | 398.2610504201678   | 581.0439095219863  | 582.2684502048478  | 581.0439095219863  |

| 2994.0       | 94862.61000000006      | 417.89696035242315  | 623.1607772763742  | 624.5379298449825  | 623.1607772763742  |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

3 rows selected (205.266 seconds)

 

So according to above what Hive calls STDDEV is a pointer to STDDEV_POP.

 

But Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP as shown in the results below from Oracle and Sybase respectively

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT AVERAGE_ORDER     STDDEV STDDEV_SAMP STDDEV_POP

----------- --------------------- ------------- ---------- ----------- ----------

          429              94819.41    410.473636 615.038404  615.038404 613.705708

         1743              94786.13     398.26105  582.26845   582.26845  581.04391

         2994              94862.61     417.89696  624.53793   624.53793 623.160777

 

 

 

Customer_ID                               Total_customer_amount                                   Average_order                       STDDEV                      STDDEV_SAMP                 STDDEV_POP

-------------------- ----------------------------------------- ---------------------------------------------------- --------------------------- --------------------------- ---------------------------

                  429                                  94819.41                                    410.4736363636363                  615.038404                  615.038404                  613.705708

                 1743                                  94786.13                                    398.2610504201680                  582.268450                  582.268450                  581.043910

                 2994                                  94862.61                                    417.8969603524229                  624.537930                  624.537930                  623.160777

 

So may be the point goes beyond Hive documentation. The value provided by STDDEV in Hive does not appear to be industry standard

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Lefty Leverenz [mailto:leftyleverenz@gmail.com] 
Sent: 01 April 2015 07:14
To: user@hive.apache.org
Subject: Re: Standard deviation (STDDEV) function calculation in Hive

 

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is the same, I don't know.)  But the Hive documentation doesn't explain much:  Built-in Aggregate Functions (UDAF) <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)> .




-- Lefty

 

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

Hi,

 

Basically, the standard deviation (STDDEV) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. It is widely used in trading systems such as FX.

 

STDEDV in Hive is explained here <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>  as below and I quote:

 

The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard deviation and sample standard deviation, respectively, of the input values. (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input rows matched by the query. The difference is that STDDEV_SAMP() is scaled by 1/(N-1) while STDDEV_POP() is scaled by 1/N.

 

Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and according to above it ought to be STDDEV_SAMP.

 

However, when I work these out (and also use straight forward calculation myself), it turns out that the alias seems to be to STDDVE_POP as opposed to STDDEV_SAMP!.

 

The following calculation shows this

 

SELECT

          rs.Customer_ID

        , rs.Total_customer_amount

       , rs.stddev

        , rs.sdddev_samp

        , rs.mystddev

FROM

(

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        stddev(amount_sold) AS stddev,

        stddev_samp(amount_sold) AS sdddev_samp,

        CASE

        WHEN  COUNT (amount_sold) <= 1

           THEN  0

         ELSE  SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))

         END AS mystddev

        FROM sales

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < stddev(amount_sold)

) rs

ORDER BY

          rs.Total_customer_amount DESC

;

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |   rs.sdddev_samp   |    rs.mystddev     |

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| 11407.0         | 103412.65999999995        | 622.221465710723   | 623.4797510518939  | 623.4797510518938  |

| 10747.0         | 99578.08999999997         | 600.7615005975689  | 601.9383117167412  | 601.9383117167412  |

 

OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are practically identical, whereas what is referred to as rs.stddev in Hive is not the one used in industry?  

 

To show I ran the same in Oracle and the below is the result.

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV              MYSTDDEV

----------- --------------------- --------------------- ---------------------

      11407             103412.66     623.4797510518940     623.4797510518940

      10747              99578.09     601.9383117167410     601.9383117167410

 

So sounds like for one reason or other what is called STDDEV in Hive and aliased to STDDEV_SAMP is incorrect?

 

Thanks,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

 


Re: Standard deviation (STDDEV) function calculation in Hive

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi Lefty,

ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_pop", new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF("stddev_samp", new GenericUDAFStdSample());

Looks like stddev() in hive is stddev_pop(), you can update the UDF docs to
match the FunctionRegistry aliases.

Cheers,
Gopal

From:  Lefty Leverenz <le...@gmail.com>
Reply-To:  "user@hive.apache.org" <us...@hive.apache.org>
Date:  Tuesday, March 31, 2015 at 11:13 PM
To:  "user@hive.apache.org" <us...@hive.apache.org>
Subject:  Re: Standard deviation (STDDEV) function calculation in Hive

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is
the same, I don't know.)  But the Hive documentation doesn't explain much:
Built-in Aggregate Functions (UDAF)
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Languag
eManualUDF-Built-inAggregateFunctions(UDAF)> .

-- Lefty

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Hi,
>  
> Basically, the standard deviation (STDDEV) is a measure that is used to
> quantify the amount of variation or dispersion of a set of data values. It is
> widely used in trading systems such as FX.
>  
> STDEDV in Hive is explained here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-
> 0-x/topics/impala_stddev.html>  as below and I quote:
>  
> The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard
> deviation and sample standard deviation, respectively, of the input values.
> (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input
> rows matched by the query. The difference is that STDDEV_SAMP() is scaled by
> 1/(N-1) while STDDEV_POP() is scaled by 1/N.
>  
> Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
> according to above it ought to be STDDEV_SAMP.
>  
> However, when I work these out (and also use straight forward calculation
> myself), it turns out that the alias seems to be to STDDVE_POP as opposed to
> STDDEV_SAMP!.
>  
> The following calculation shows this
>  
> SELECT
>           rs.Customer_ID
>         , rs.Total_customer_amount
>        , rs.stddev
>         , rs.sdddev_samp
>         , rs.mystddev
> FROM
> (
>         SELECT cust_id AS Customer_ID,
>         COUNT(amount_sold) AS Number_of_orders,
>         SUM(amount_sold) AS Total_customer_amount,
>         AVG(amount_sold) AS Average_order,
>         stddev(amount_sold) AS stddev,
>         stddev_samp(amount_sold) AS sdddev_samp,
>         CASE
>         WHEN  COUNT (amount_sold) <= 1
>            THEN  0
>          ELSE  
> SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1
> )-1))
>          END AS mystddev
>         FROM sales
>         GROUP BY cust_id
>         HAVING SUM(amount_sold) > 94000
>         AND AVG(amount_sold) < stddev(amount_sold)
> ) rs
> ORDER BY
>           rs.Total_customer_amount DESC
> ;
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |
> rs.sdddev_samp   |   rs.mystddev     |
> +-----------------+---------------------------+--------------------+----------
> ----------+--------------------+--+
> | 11407.0         | 103412.65999999995        | 622.221465710723   |
> 623.4797510518939  | 623.4797510518938  |
> | 10747.0         | 99578.08999999997         | 600.7615005975689  |
> 601.9383117167412  | 601.9383117167412  |
>  
> OK so looking above, we notice that rs.sdddev_sampandrs.mystddevare
> practically identical, whereas what is referred to as rs.stddev in Hive is not
> the one used in industry?
>  
> To show I ran the same in Oracle and the below is the result.
>  
> Connected to:
> Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
> With the Partitioning, OLAP, Data Mining and Real Application Testing options
>  
> CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV              MYSTDDEV
> ----------- --------------------- --------------------- ---------------------
>       11407             103412.66     623.4797510518940     623.4797510518940
>       10747              99578.09     601.9383117167410     601.9383117167410
>  
> So sounds like for one reason or other what is called STDDEV in Hive and
> aliased to STDDEV_SAMP is incorrect?
>  
> Thanks,
>  
> Mich Talebzadeh
>  
> http://talebzadehmich.wordpress.com
>  
> Publications due shortly:
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache
>  
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
> or their employees, unless expressly so stated. It is the responsibility of
> the recipient to ensure that this email is virus free, therefore neither
> Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>