You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/02/25 02:19:00 UTC
[jira] [Resolved] (SPARK-30939) StringIndexer setOutputCols does
not set output cols
[ https://issues.apache.org/jira/browse/SPARK-30939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-30939.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 27684
[https://github.com/apache/spark/pull/27684]
> StringIndexer setOutputCols does not set output cols
> ----------------------------------------------------
>
> Key: SPARK-30939
> URL: https://issues.apache.org/jira/browse/SPARK-30939
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 3.0.0
> Reporter: Sean R. Owen
> Assignee: Sean R. Owen
> Priority: Major
> Fix For: 3.0.0
>
>
> (Credit to Brooke Wenig for finding it). Quoting:
> ".. The python code works completely fine, but the scala code is outputting
> {code}
> strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
> {code}
> for the output of the string indexer, instead of using the column names specified in here:
> {code}
> val stringIndexer = new StringIndexer()
> .setInputCols(categoricalCols)
> .setOutputCols(indexOutputCols)
> .setHandleInvalid("skip")
> {code}
> I was expecting the resulting column names to be
> {code}
> indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
> {code}
> Indeed I'm pretty sure this is the bug:
> {code}
> private def validateAndTransformField(
> schema: StructType,
> inputColName: String,
> outputColName: String): StructField = {
> val inputDataType = schema(inputColName).dataType
> require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
> s"The input column $inputColName must be either string type or numeric type, " +
> s"but got $inputDataType.")
> require(schema.fields.forall(_.name != outputColName),
> s"Output column $outputColName already exists.")
> NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
> }
> {code}
> The last line does not use the transformed output col name, but the default single output col parameter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org