You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/04/05 00:27:25 UTC
Standard Deviation in Hive 2 is still incorrect
Hi,
I reported back in April 2015 that what Hive calls Standard Deviation
Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
been rectified in Hive 2
Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
did tests with Spark 1.6 as well and Spark correctly points STTDEV to
STDDEV_SAMP.
The following query was used
SELECT
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
STDDEV(amount_sold) AS STDDEV,
STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
STDDEV_POP(amount_sold) AS STDDEV_POP
from sales;
The following is from running the above query on Hive where STDDEV -->
STDDEV_POP which is incorrect
+--------------------+---------------------+--------------------+---------------------+--+
| mystddev | stddev | stddev_samp |
stddev_pop |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411 | 260.72704617040444 | 260.7270722861465 |
260.72704617040444
|
+--------------------+---------------------+--------------------+---------------------+--+
The following is from Spark-sql where STDDEV --> STDDEV_SAMP which is
correct
+--------------------+---------------------+--------------------+---------------------+--+
| mystddev | stddev | stddev_samp |
stddev_pop |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411 | 260.7270722861637 | 260.7270722861637 |
260.72704617042166 |
+--------------------+---------------------+--------------------+---------------------+--+
Hopefully The Hive one will be corrected.
Thanks
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
Re: Standard Deviation in Hive 2 is still incorrect
Posted by Mich Talebzadeh <mi...@gmail.com>.
HIVE-13574 <https://issues.apache.org/jira/browse/HIVE-13574>
Created and assigned to myself
Thanks
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 20 April 2016 at 06:54, Mich Talebzadeh <mi...@gmail.com>
wrote:
> Will do thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 23:33, Alan Gates <al...@gmail.com> wrote:
>
>> Have you filed a JIRA ticket for this? If not, please do so we can track
>> it and fix it. Patches are welcomed as well. :)
>>
>> Alan.
>>
>> > On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I reported back in April 2015 that what Hive calls Standard Deviation
>> Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
>> been rectified in Hive 2
>> >
>> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also
>> I did tests with Spark 1.6 as well and Spark correctly points STTDEV to
>> STDDEV_SAMP.
>> >
>> > The following query was used
>> >
>> > SELECT
>> >
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>> > STDDEV(amount_sold) AS STDDEV,
>> > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>> > STDDEV_POP(amount_sold) AS STDDEV_POP
>> > from sales;
>> >
>> > The following is from running the above query on Hive where STDDEV -->
>> STDDEV_POP which is incorrect
>> >
>> >
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | mystddev | stddev | stddev_samp |
>> stddev_pop |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | 260.7270919450411 | 260.72704617040444 | 260.7270722861465 |
>> 260.72704617040444 |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> >
>> > The following is from Spark-sql where STDDEV --> STDDEV_SAMP which is
>> correct
>> >
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | mystddev | stddev | stddev_samp |
>> stddev_pop |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> > | 260.7270919450411 | 260.7270722861637 | 260.7270722861637 |
>> 260.72704617042166 |
>> >
>> +--------------------+---------------------+--------------------+---------------------+--+
>> >
>> > Hopefully The Hive one will be corrected.
>> >
>> > Thanks
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> > LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>>
>>
>
Re: Standard Deviation in Hive 2 is still incorrect
Posted by Mich Talebzadeh <mi...@gmail.com>.
Will do thanks
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 19 April 2016 at 23:33, Alan Gates <al...@gmail.com> wrote:
> Have you filed a JIRA ticket for this? If not, please do so we can track
> it and fix it. Patches are welcomed as well. :)
>
> Alan.
>
> > On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> >
> > Hi,
> >
> > I reported back in April 2015 that what Hive calls Standard Deviation
> Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not
> been rectified in Hive 2
> >
> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also
> I did tests with Spark 1.6 as well and Spark correctly points STTDEV to
> STDDEV_SAMP.
> >
> > The following query was used
> >
> > SELECT
> >
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
> > STDDEV(amount_sold) AS STDDEV,
> > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> > STDDEV_POP(amount_sold) AS STDDEV_POP
> > from sales;
> >
> > The following is from running the above query on Hive where STDDEV -->
> STDDEV_POP which is incorrect
> >
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | mystddev | stddev | stddev_samp |
> stddev_pop |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411 | 260.72704617040444 | 260.7270722861465 |
> 260.72704617040444 |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > The following is from Spark-sql where STDDEV --> STDDEV_SAMP which is
> correct
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | mystddev | stddev | stddev_samp |
> stddev_pop |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411 | 260.7270722861637 | 260.7270722861637 |
> 260.72704617042166 |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > Hopefully The Hive one will be corrected.
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
>
>
Re: Standard Deviation in Hive 2 is still incorrect
Posted by Alan Gates <al...@gmail.com>.
Have you filed a JIRA ticket for this? If not, please do so we can track it and fix it. Patches are welcomed as well. :)
Alan.
> On Apr 4, 2016, at 15:27, Mich Talebzadeh <mi...@gmail.com> wrote:
>
>
> Hi,
>
> I reported back in April 2015 that what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified in Hive 2
>
> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did tests with Spark 1.6 as well and Spark correctly points STTDEV to STDDEV_SAMP.
>
> The following query was used
>
> SELECT
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,
> STDDEV(amount_sold) AS STDDEV,
> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> STDDEV_POP(amount_sold) AS STDDEV_POP
> from sales;
>
> The following is from running the above query on Hive where STDDEV --> STDDEV_POP which is incorrect
>
>
> +--------------------+---------------------+--------------------+---------------------+--+
> | mystddev | stddev | stddev_samp | stddev_pop |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411 | 260.72704617040444 | 260.7270722861465 | 260.72704617040444 |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> The following is from Spark-sql where STDDEV --> STDDEV_SAMP which is correct
>
> +--------------------+---------------------+--------------------+---------------------+--+
> | mystddev | stddev | stddev_samp | stddev_pop |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411 | 260.7270722861637 | 260.7270722861637 | 260.72704617042166 |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> Hopefully The Hive one will be corrected.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
> LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>