You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mungeol Heo <mu...@gmail.com> on 2016/07/07 08:23:37 UTC

stddev_samp() gives NaN

Hello,

As I mentioned at the title, stddev_samp function gives a NaN while
stddev_pop gives a numeric value on the same data.
The stddev_samp function will give a numeric value, if I cast it to decimal.
E.g. cast(stddev_samp(column_name) as decimal(16,3))
Is it a bug?

Thanks

- mungeol

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Mungeol Heo <mu...@gmail.com>.

The N is much bigger than 1 in my case.

Here is an example describes my issue.
"select column1, stddev_samp(column2) from table1 group by column1" gives NaN
"select column1, cast(stddev_samp(column2) as decimal(16,3)) from
table1 group by column1" gives numeric values. e.g. 234.234
"select column1, stddev_pop(column2) from table1 group by column1"
gives numeric values. e.g. 123.123123123

The column1, column2, and table1 are same.
My guess is that the stddev_samp function returns double type that
does not exactly match standard floating point semantics in some case.
That's why spark gives NaN.
It seems stddev_samp does not handle NaN well. Not like stddev_pop.

On Thu, Jul 7, 2016 at 5:57 PM, Sean Owen <so...@cloudera.com> wrote:
> Sample standard deviation can't be defined in the case of N=1, because
> it has N-1 in the denominator. My guess is that this is the case
> you're seeing. A population of N=1 still has a standard deviation of
> course (which is 0).
>
> On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo <mu...@gmail.com> wrote:
>> I know stddev_samp and stddev_pop gives different values, because they
>> have different definition. What I want to know is why stddev_samp
>> gives "NaN", and not a numeric value.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Sean Owen <so...@cloudera.com>.

Sample standard deviation can't be defined in the case of N=1, because
it has N-1 in the denominator. My guess is that this is the case
you're seeing. A population of N=1 still has a standard deviation of
course (which is 0).

On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo <mu...@gmail.com> wrote:
> I know stddev_samp and stddev_pop gives different values, because they
> have different definition. What I want to know is why stddev_samp
> gives "NaN", and not a numeric value.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Mungeol Heo <mu...@gmail.com>.

I know stddev_samp and stddev_pop gives different values, because they
have different definition. What I want to know is why stddev_samp
gives "NaN", and not a numeric value.

On Thu, Jul 7, 2016 at 5:39 PM, Sean Owen <so...@cloudera.com> wrote:
> I don't think that's relevant here. The question is why would samp
> give a different result to pop, not the result of "stddev". Neither
> one is a 'correct' definition of standard deviation in the abstract;
> one or the other is correct depending on what standard deviation you
> are trying to measure.
>
> On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
>> The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is the
>> correct one.
>>
>> You can actually work that one out yourself
>>
>>
>> BTW Hive also gives a wrong value. This is what I reported back in April
>> about Hive giving incorrect value
>>
>> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
>> did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
>> STDDEV_SAMP.
>>
>> The following query was used
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>         STDDEV(amount_sold) AS STDDEV,
>>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>         STDDEV_POP(amount_sold) AS STDDEV_POP
>> from    sales;
>>
>> The following is from running the above query on Hive where STDDEV -->
>> STDDEV_POP which is incorrect
>>
>>
>> +--------------------+---------------------+--------------------+---------------------+--+
>> |      mystddev      |       stddev        |    stddev_samp     |
>> stddev_pop      |
>> +--------------------+---------------------+--------------------+---------------------+--+
>> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
>> 260.72704617040444  |
>> +--------------------+---------------------+--------------------+---------------------+--+
>>
>> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
>> correct
>>
>> +--------------------+---------------------+--------------------+---------------------+--+
>> |      mystddev      |       stddev        |    stddev_samp     |
>> stddev_pop      |
>> +--------------------+---------------------+--------------------+---------------------+--+
>> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
>> 260.72704617042166  |
>> +--------------------+---------------------+--------------------+---------------------+--+
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> No, because these are different values defined differently. If you
>>> have 1 data point, the sample stdev is undefined while population
>>> stdev is defined. Refer to their definition.
>>>
>>> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mu...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > As I mentioned at the title, stddev_samp function gives a NaN while
>>> > stddev_pop gives a numeric value on the same data.
>>> > The stddev_samp function will give a numeric value, if I cast it to
>>> > decimal.
>>> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
>>> > Is it a bug?
>>> >
>>> > Thanks
>>> >
>>> > - mungeol
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Sean Owen <so...@cloudera.com>.

The OP is not calling stddev though, so I still don't see that this is
the question at hand.

But while we're off on the topic -- while I certainly agree that
stddev is mapped to the sample standard deviation in DBs, it doesn't
actually make much sense as a default.

What you get back is not the standard deviation (as in, sqrt of second
central moment) of the values in the grouping or table, which is I
presume what people think they're getting.

You're getting an estimate the standard deviation of a population from
which the values are theoretically some random sample, but that's
rarely true. I disagree that this is the general use case, so, have
always thought this was a just a historical practice in RDBMSes that
was actually not a good decision.

Maybe that's why Hive defined it differently, but, even I would prefer
consistency in this regard.

On Thu, Jul 7, 2016 at 9:41 AM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> stddev is mapped to stdddev_samp. That is the general use case or rather
> common use of standard deviation.
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Mich Talebzadeh <mi...@gmail.com>.

stddev is mapped to stdddev_samp. That is the general use case or rather
common use of standard deviation.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 July 2016 at 09:39, Sean Owen <so...@cloudera.com> wrote:

> I don't think that's relevant here. The question is why would samp
> give a different result to pop, not the result of "stddev". Neither
> one is a 'correct' definition of standard deviation in the abstract;
> one or the other is correct depending on what standard deviation you
> are trying to measure.
>
> On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
> > The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is
> the
> > correct one.
> >
> > You can actually work that one out yourself
> >
> >
> > BTW Hive also gives a wrong value. This is what I reported back in April
> > about Hive giving incorrect value
> >
> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
> > did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
> > STDDEV_SAMP.
> >
> > The following query was used
> >
> > SELECT
> >
> >
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> > AS MYSTDDEV,
> >         STDDEV(amount_sold) AS STDDEV,
> >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> >         STDDEV_POP(amount_sold) AS STDDEV_POP
> > from    sales;
> >
> > The following is from running the above query on Hive where STDDEV -->
> > STDDEV_POP which is incorrect
> >
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
> > stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
> > 260.72704617040444  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
> > correct
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
> > stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
> > 260.72704617042166  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > HTH
> >
> >
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> No, because these are different values defined differently. If you
> >> have 1 data point, the sample stdev is undefined while population
> >> stdev is defined. Refer to their definition.
> >>
> >> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mu...@gmail.com>
> wrote:
> >> > Hello,
> >> >
> >> > As I mentioned at the title, stddev_samp function gives a NaN while
> >> > stddev_pop gives a numeric value on the same data.
> >> > The stddev_samp function will give a numeric value, if I cast it to
> >> > decimal.
> >> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
> >> > Is it a bug?
> >> >
> >> > Thanks
> >> >
> >> > - mungeol
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
>

Re: stddev_samp() gives NaN

Posted by Sean Owen <so...@cloudera.com>.

I don't think that's relevant here. The question is why would samp
give a different result to pop, not the result of "stddev". Neither
one is a 'correct' definition of standard deviation in the abstract;
one or the other is correct depending on what standard deviation you
are trying to measure.

On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is the
> correct one.
>
> You can actually work that one out yourself
>
>
> BTW Hive also gives a wrong value. This is what I reported back in April
> about Hive giving incorrect value
>
> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
> did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
> STDDEV_SAMP.
>
> The following query was used
>
> SELECT
>
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
>         STDDEV(amount_sold) AS STDDEV,
>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>         STDDEV_POP(amount_sold) AS STDDEV_POP
> from    sales;
>
> The following is from running the above query on Hive where STDDEV -->
> STDDEV_POP which is incorrect
>
>
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |
> stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
> 260.72704617040444  |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
> correct
>
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |
> stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
> 260.72704617042166  |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, because these are different values defined differently. If you
>> have 1 data point, the sample stdev is undefined while population
>> stdev is defined. Refer to their definition.
>>
>> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mu...@gmail.com> wrote:
>> > Hello,
>> >
>> > As I mentioned at the title, stddev_samp function gives a NaN while
>> > stddev_pop gives a numeric value on the same data.
>> > The stddev_samp function will give a numeric value, if I cast it to
>> > decimal.
>> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
>> > Is it a bug?
>> >
>> > Thanks
>> >
>> > - mungeol
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: stddev_samp() gives NaN

Posted by Mich Talebzadeh <mi...@gmail.com>.

The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is
the correct one.

You can actually work that one out yourself


BTW Hive also gives a wrong value. This is what I reported back in April
about Hive giving incorrect value

Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
STDDEV_SAMP.

The following query was used

SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;

The following is from running the above query on Hive where STDDEV -->
 STDDEV_POP which is incorrect

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
260.72704617040444
|
+--------------------+---------------------+--------------------+---------------------+--+

The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
correct

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
260.72704617042166  |
+--------------------+---------------------+--------------------+---------------------+--+

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:

> No, because these are different values defined differently. If you
> have 1 data point, the sample stdev is undefined while population
> stdev is defined. Refer to their definition.
>
> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mu...@gmail.com> wrote:
> > Hello,
> >
> > As I mentioned at the title, stddev_samp function gives a NaN while
> > stddev_pop gives a numeric value on the same data.
> > The stddev_samp function will give a numeric value, if I cast it to
> decimal.
> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
> > Is it a bug?
> >
> > Thanks
> >
> > - mungeol
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: stddev_samp() gives NaN

Posted by Sean Owen <so...@cloudera.com>.

No, because these are different values defined differently. If you
have 1 data point, the sample stdev is undefined while population
stdev is defined. Refer to their definition.

On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mu...@gmail.com> wrote:
> Hello,
>
> As I mentioned at the title, stddev_samp function gives a NaN while
> stddev_pop gives a numeric value on the same data.
> The stddev_samp function will give a numeric value, if I cast it to decimal.
> E.g. cast(stddev_samp(column_name) as decimal(16,3))
> Is it a bug?
>
> Thanks
>
> - mungeol
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org