You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/03/03 23:16:18 UTC

[jira] [Comment Edited] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

    [ https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178631#comment-15178631 ] 

Joseph K. Bradley edited comment on SPARK-13639 at 3/3/16 10:15 PM:
--------------------------------------------------------------------

-I agree with [~srowen] about leaving the current behavior to make it clear the user needs to clean their data.-

I didn't realize this was important for [SPARK-13568].  I'd still prefer not to change the behavior of Statistics.colStats.  Could it be an optional setting?


was (Author: josephkb):
I agree with [~srowen] about leaving the current behavior to make it clear the user needs to clean their data.

> Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13639
>                 URL: https://issues.apache.org/jira/browse/SPARK-13639
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: yuhao yang
>            Priority: Trivial
>
>    val denseData = Array(
>       Vectors.dense(3.8, 0.0, 1.8),
>       Vectors.dense(1.7, 0.9, 0.0),
>       Vectors.dense(Double.NaN, 0, 0.0)
>     )
>     val rdd = sc.parallelize(denseData)
>     println(Statistics.colStats(rdd).mean)
> [NaN,0.3,0.6]
> This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org