You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "YijieFan (Jira)" <ji...@apache.org> on 2020/05/09 17:34:00 UTC
[jira] [Created] (SPARK-31671) Wrong error message in
VectorAssembler when column lengths can not be inferred
YijieFan created SPARK-31671:
--------------------------------
Summary: Wrong error message in VectorAssembler when column lengths can not be inferred
Key: SPARK-31671
URL: https://issues.apache.org/jira/browse/SPARK-31671
Project: Spark
Issue Type: Bug
Components: ML
Affects Versions: 2.4.4, 3.0.1
Environment: Mac OS catalina
Reporter: YijieFan
In VectorAssembler when input column lengths can not be inferred and handleInvalid = "keep", it will throw a runtime exception with message like below
_Can not infer column lengths with handleInvalid = "keep". *Consider using VectorSizeHint*_
*_|to add metadata for columns: [column1, column2]_*
However, even if you set vector size hint for *column1*, the message remains, and will not change to *[column2]* only. This is not consistent with the description in the error message.
This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.
Here is a simple example:
{code:java}
// create a df without vector size
val df = Seq(
(Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")
// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
.setInputCol("n1")
.setSize(1)
.transform(df)
// assemble n1, n2
val output = new VectorAssembler()
.setInputCols(Array("n1", "n2"))
.setOutputCol("features")
.setHandleInvalid("keep")
.transform(hintedDf)
// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
{code}
Expected error message:
{code:java}
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
{code}
Actual error message:
{code:java}
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
{code}
I change one line in VectorAssembler.scala, so that it can work properly as expected.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org