You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by yl...@apache.org on 2016/09/29 07:54:33 UTC
spark git commit: [SPARK-16356][FOLLOW-UP][ML] Enforce ML test of
exception for local/distributed Dataset.
Repository: spark
Updated Branches:
refs/heads/master 37eb9184f -> a19a1bb59
[SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed Dataset.
## What changes were proposed in this pull request?
#14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <yb...@gmail.com>
Closes #15261 from yanboliang/spark-16356.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a19a1bb5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a19a1bb5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a19a1bb5
Branch: refs/heads/master
Commit: a19a1bb59411177caaf99581e89098826b7d0c7b
Parents: 37eb918
Author: Yanbo Liang <yb...@gmail.com>
Authored: Thu Sep 29 00:54:26 2016 -0700
Committer: Yanbo Liang <yb...@gmail.com>
Committed: Thu Sep 29 00:54:26 2016 -0700
----------------------------------------------------------------------
.../apache/spark/ml/feature/VectorIndexerSuite.scala | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/a19a1bb5/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
----------------------------------------------------------------------
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
index 4da1b13..b28ce2a 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
@@ -88,9 +88,7 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext
densePoints1 = densePoints1Seq.map(FeatureData).toDF()
sparsePoints1 = sparsePoints1Seq.map(FeatureData).toDF()
- // TODO: If we directly use `toDF` without parallelize, the test in
- // "Throws error when given RDDs with different size vectors" is failed for an unknown reason.
- densePoints2 = sc.parallelize(densePoints2Seq, 2).map(FeatureData).toDF()
+ densePoints2 = densePoints2Seq.map(FeatureData).toDF()
sparsePoints2 = sparsePoints2Seq.map(FeatureData).toDF()
badPoints = badPointsSeq.map(FeatureData).toDF()
}
@@ -121,10 +119,17 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext
model.transform(densePoints1) // should work
model.transform(sparsePoints1) // should work
- intercept[SparkException] {
+ // If the data is local Dataset, it throws AssertionError directly.
+ intercept[AssertionError] {
model.transform(densePoints2).collect()
logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
}
+ // If the data is distributed Dataset, it throws SparkException
+ // which is the wrapper of AssertionError.
+ intercept[SparkException] {
+ model.transform(densePoints2.repartition(2)).collect()
+ logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
+ }
intercept[SparkException] {
vectorIndexer.fit(badPoints)
logInfo("Did not throw error when fitting vectors of different lengths in same RDD.")
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org