You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/04/02 14:43:00 UTC
[jira] [Updated] (SPARK-38584) Unify the data validation

     [ https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen updated SPARK-38584:
---------------------------------
    Priority: Minor  (was: Major)

> Unify the data validation
> -------------------------
>
>                 Key: SPARK-38584
>                 URL: https://issues.apache.org/jira/browse/SPARK-38584
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.4.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Minor
>             Fix For: 3.4.0
>
>
> 1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then:
>  * the training may run successfuly and return model containing invalid coefficients, like LinearSVC
>  * the training may fail with irrelevant message, like KMeans
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity
>     at scala.Predef$.require(Predef.scala:281)
>     at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>  
> We should make ml algorithms fail fast, if the input dataset is invalid.
>  
> 2, there exists some methods to validate input labels and weights in different files:
>  * {{org.apache.spark.ml.functions}}
>  * org.apache.spark.ml.util.DatasetUtils
>  * org.apache.spark.ml.util.MetadataUtils,
>  * org.apache.spark.ml.Predictor
>  * etc.
>  
> I think it is time to unify realtive methods to one source file.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org