You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "YijieFan (Jira)" <ji...@apache.org> on 2020/05/09 17:34:00 UTC

[jira] [Created] (SPARK-31671) Wrong error message in VectorAssembler when column lengths can not be inferred

YijieFan created SPARK-31671:
--------------------------------

             Summary: Wrong error message in VectorAssembler  when column lengths can not be inferred
                 Key: SPARK-31671
                 URL: https://issues.apache.org/jira/browse/SPARK-31671
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.4.4, 3.0.1
         Environment: Mac OS  catalina
            Reporter: YijieFan


In VectorAssembler when input column lengths can not be inferred and handleInvalid = "keep", it will throw a runtime exception with message like below

_Can not infer column lengths with handleInvalid = "keep". *Consider using VectorSizeHint*_
 *_|to add metadata for columns: [column1, column2]_*

However, even if you set vector size hint for *column1*, the message remains, and will not change to  *[column2]* only. This is not consistent with the description in the error message.

This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.

Here is a simple example:

 
{code:java}
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
{code}
Expected error message:

 
{code:java}
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
{code}
Actual error message:
{code:java}
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
{code}
I change one line in VectorAssembler.scala, so that it can work properly as expected. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org