You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/04/05 00:27:25 UTC

Standard Deviation in Hive 2 is still incorrect

Hi,

I reported back in April 2015 that what Hive calls Standard Deviation
Function  STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
been rectified in Hive 2

Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
did tests with Spark 1.6 as well and Spark correctly points STTDEV to
STDDEV_SAMP.

The following query was used

SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;

The following is from running the above query on Hive where STDDEV -->
 STDDEV_POP which is incorrect

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
260.72704617040444
|
+--------------------+---------------------+--------------------+---------------------+--+

The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
correct

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
260.72704617042166  |
+--------------------+---------------------+--------------------+---------------------+--+

Hopefully The Hive one will be corrected.

Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com

Re: Standard Deviation in Hive 2 is still incorrect

Posted by Mich Talebzadeh <mi...@gmail.com>.
HIVE-13574 <https://issues.apache.org/jira/browse/HIVE-13574>

Created and assigned to myself

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 06:54, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Will do thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 23:33, Alan Gates <al...@gmail.com> wrote:
>
>> Have you filed a JIRA ticket for this?  If not, please do so we can track
>> it and fix it.  Patches are welcomed as well. :)
>>
>> Alan.
>>
>> > On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I reported back in April 2015 that what Hive calls Standard Deviation
>> Function  STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
>> been rectified in Hive 2
>> >
>> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also
>> I did tests with Spark 1.6 as well and Spark correctly points STTDEV to
>> STDDEV_SAMP.
>> >
>> > The following query was used
>> >
>> > SELECT
>> >
>>  SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>> >         STDDEV(amount_sold) AS STDDEV,
>> >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>> >         STDDEV_POP(amount_sold) AS STDDEV_POP
>> > from    sales;
>> >
>> > The following is from running the above query on Hive where STDDEV -->
>> STDDEV_POP which is incorrect
>> >
>> >
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > |      mystddev      |       stddev        |    stddev_samp     |
>>  stddev_pop      |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
>> 260.72704617040444  |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> >
>> > The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
>> correct
>> >
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > |      mystddev      |       stddev        |    stddev_samp     |
>>  stddev_pop      |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
>> 260.72704617042166  |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> >
>> > Hopefully The Hive one will be corrected.
>> >
>> > Thanks
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> > LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>>
>>
>

Re: Standard Deviation in Hive 2 is still incorrect

Posted by Mich Talebzadeh <mi...@gmail.com>.
Will do thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 19 April 2016 at 23:33, Alan Gates <al...@gmail.com> wrote:

> Have you filed a JIRA ticket for this?  If not, please do so we can track
> it and fix it.  Patches are welcomed as well. :)
>
> Alan.
>
> > On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> >
> > Hi,
> >
> > I reported back in April 2015 that what Hive calls Standard Deviation
> Function  STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
> been rectified in Hive 2
> >
> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also
> I did tests with Spark 1.6 as well and Spark correctly points STTDEV to
> STDDEV_SAMP.
> >
> > The following query was used
> >
> > SELECT
> >
>  SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
> >         STDDEV(amount_sold) AS STDDEV,
> >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> >         STDDEV_POP(amount_sold) AS STDDEV_POP
> > from    sales;
> >
> > The following is from running the above query on Hive where STDDEV -->
> STDDEV_POP which is incorrect
> >
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
>  stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
> 260.72704617040444  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
> correct
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
>  stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
> 260.72704617042166  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > Hopefully The Hive one will be corrected.
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
>
>

Re: Standard Deviation in Hive 2 is still incorrect

Posted by Alan Gates <al...@gmail.com>.
Have you filed a JIRA ticket for this?  If not, please do so we can track it and fix it.  Patches are welcomed as well. :)

Alan.

> On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> 
> Hi,
> 
> I reported back in April 2015 that what Hive calls Standard Deviation Function  STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified in Hive 2
> 
> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did tests with Spark 1.6 as well and Spark correctly points STTDEV to STDDEV_SAMP.
> 
> The following query was used
> 
> SELECT
>         SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,
>         STDDEV(amount_sold) AS STDDEV,
>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>         STDDEV_POP(amount_sold) AS STDDEV_POP
> from    sales;
> 
> The following is from running the above query on Hive where STDDEV -->  STDDEV_POP which is incorrect
> 
> 
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |     stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  | 260.72704617040444  |
> +--------------------+---------------------+--------------------+---------------------+--+
> 
> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is correct
> 
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |     stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  | 260.72704617042166  |
> +--------------------+---------------------+--------------------+---------------------+--+
> 
> Hopefully The Hive one will be corrected.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>