You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/02/25 02:19:00 UTC

[jira] [Resolved] (SPARK-30939) StringIndexer setOutputCols does not set output cols

     [ https://issues.apache.org/jira/browse/SPARK-30939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-30939.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 27684
[https://github.com/apache/spark/pull/27684]

> StringIndexer setOutputCols does not set output cols
> ----------------------------------------------------
>
>                 Key: SPARK-30939
>                 URL: https://issues.apache.org/jira/browse/SPARK-30939
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: Sean R. Owen
>            Assignee: Sean R. Owen
>            Priority: Major
>             Fix For: 3.0.0
>
>
> (Credit to Brooke Wenig for finding it). Quoting:
> ".. The python code works completely fine, but the scala code is outputting
> {code}
> strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
> {code}
> for the output of the string indexer, instead of using the column names specified in here:
> {code}
> val stringIndexer = new StringIndexer()
>   .setInputCols(categoricalCols)
>   .setOutputCols(indexOutputCols)
>   .setHandleInvalid("skip")
> {code}
> I was expecting the resulting column names to be
> {code}
> indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
> {code}
> Indeed I'm pretty sure this is the bug:
> {code}
>   private def validateAndTransformField(
>       schema: StructType,
>       inputColName: String,
>       outputColName: String): StructField = {
>     val inputDataType = schema(inputColName).dataType
>     require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
>       s"The input column $inputColName must be either string type or numeric type, " +
>         s"but got $inputDataType.")
>     require(schema.fields.forall(_.name != outputColName),
>       s"Output column $outputColName already exists.")
>     NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
>   }
> {code}
> The last line does not use the transformed output col name, but the default single output col parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org