You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:22:04 UTC

[jira] [Updated] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

     [ https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-13639:
---------------------------------
    Labels: bulk-closed  (was: )

> Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13639
>                 URL: https://issues.apache.org/jira/browse/SPARK-13639
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: yuhao yang
>            Priority: Trivial
>              Labels: bulk-closed
>
>    val denseData = Array(
>       Vectors.dense(3.8, 0.0, 1.8),
>       Vectors.dense(1.7, 0.9, 0.0),
>       Vectors.dense(Double.NaN, 0, 0.0)
>     )
>     val rdd = sc.parallelize(denseData)
>     println(Statistics.colStats(rdd).mean)
> [NaN,0.3,0.6]
> This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org