You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2020/05/11 23:26:37 UTC
[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error
message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new 7e226a2 [SPARK-31671][ML] Wrong error message in VectorAssembler
7e226a2 is described below
commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d
Author: fan31415 <fa...@gmail.com>
AuthorDate: Mon May 11 18:23:23 2020 -0500
[SPARK-31671][ML] Wrong error message in VectorAssembler
### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly.
### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.
```
// create a df without vector size
val df = Seq(
(Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")
// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
.setInputCol("n1")
.setSize(1)
.transform(df)
// assemble n1, n2
val output = new VectorAssembler()
.setInputCols(Array("n1", "n2"))
.setOutputCol("features")
.setHandleInvalid("keep")
.transform(hintedDf)
// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
```
Expected error message:
```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
```
Actual error message:
```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
```
This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test in VectorAssemblerSuite.
Closes #28487 from fan31415/SPARK-31671.
Lead-authored-by: fan31415 <fa...@gmail.com>
Co-authored-by: yijiefan <fa...@gmail.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen <sr...@gmail.com>
---
.../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +-
.../org/apache/spark/ml/feature/VectorAssemblerSuite.scala | 11 +++++++++++
2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 3070012..7bc5e56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] {
getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns)
case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
- |to add metadata for columns: ${columns.mkString("[", ", ", "]")}."""
+ |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}."""
.stripMargin.replaceAll("\n", " "))
case (_, _) => Map.empty
}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
val output = vectorAssembler.transform(dfWithNullsAndNaNs)
assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty)))
}
+
+ test("SPARK-31671: should give explicit error message when can not infer column lengths") {
+ val df = Seq(
+ (Vectors.dense(1.0), Vectors.dense(2.0))
+ ).toDF("n1", "n2")
+ val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+ val assembler = new VectorAssembler()
+ .setInputCols(Array("n1", "n2")).setOutputCol("features")
+ assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+ .getMessage.contains("n1"), "should only show no vector size columns' name")
+ }
}
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org