You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/02/24 14:37:00 UTC

[jira] [Created] (SPARK-30939) StringIndexer setOutputCols does not set output cols

Sean R. Owen created SPARK-30939:
------------------------------------

             Summary: StringIndexer setOutputCols does not set output cols
                 Key: SPARK-30939
                 URL: https://issues.apache.org/jira/browse/SPARK-30939
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 3.0.0
            Reporter: Sean R. Owen
            Assignee: Sean R. Owen


(Credit to Brooke Wenig for finding it). Quoting:

".. The python code works completely fine, but the scala code is outputting

{code}
strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
{code}

for the output of the string indexer, instead of using the column names specified in here:

{code}
val stringIndexer = new StringIndexer()
  .setInputCols(categoricalCols)
  .setOutputCols(indexOutputCols)
  .setHandleInvalid("skip")
{code}

I was expecting the resulting column names to be

{code}
indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
{code}


Indeed I'm pretty sure this is the bug:

{code}
  private def validateAndTransformField(
      schema: StructType,
      inputColName: String,
      outputColName: String): StructField = {
    val inputDataType = schema(inputColName).dataType
    require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
      s"The input column $inputColName must be either string type or numeric type, " +
        s"but got $inputDataType.")
    require(schema.fields.forall(_.name != outputColName),
      s"Output column $outputColName already exists.")
    NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
  }
{code}

The last line does not use the transformed output col name, but the default single output col parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org