You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/02/24 14:37:00 UTC
[jira] [Created] (SPARK-30939) StringIndexer setOutputCols does not
set output cols
Sean R. Owen created SPARK-30939:
------------------------------------
Summary: StringIndexer setOutputCols does not set output cols
Key: SPARK-30939
URL: https://issues.apache.org/jira/browse/SPARK-30939
Project: Spark
Issue Type: Bug
Components: ML
Affects Versions: 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen
(Credit to Brooke Wenig for finding it). Quoting:
".. The python code works completely fine, but the scala code is outputting
{code}
strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
{code}
for the output of the string indexer, instead of using the column names specified in here:
{code}
val stringIndexer = new StringIndexer()
.setInputCols(categoricalCols)
.setOutputCols(indexOutputCols)
.setHandleInvalid("skip")
{code}
I was expecting the resulting column names to be
{code}
indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
{code}
Indeed I'm pretty sure this is the bug:
{code}
private def validateAndTransformField(
schema: StructType,
inputColName: String,
outputColName: String): StructField = {
val inputDataType = schema(inputColName).dataType
require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
s"The input column $inputColName must be either string type or numeric type, " +
s"but got $inputDataType.")
require(schema.fields.forall(_.name != outputColName),
s"Output column $outputColName already exists.")
NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
}
{code}
The last line does not use the transformed output col name, but the default single output col parameter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org