You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wojciech Jurczyk (JIRA)" <ji...@apache.org> on 2016/01/18 08:36:39 UTC

[jira] [Created] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

Wojciech Jurczyk created SPARK-12874:
----------------------------------------

             Summary: ML StringIndexer does not protect itself from column name duplication
                 Key: SPARK-12874
                 URL: https://issues.apache.org/jira/browse/SPARK-12874
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 1.6.0, 1.5.2
            Reporter: Wojciech Jurczyk


StringIndexerModel, when performing transform() does not check the schema of the input DataFrame. Because of that, it is possible to create a DataFrame containing columns with duplicated names.

This issue is similar to SPARK-12711. StringIndexer could make use of transformSchema to assure that the input DataFrame schema is correct in sense of the parameters' values.

Please confirm. Then, I'll prepare a PR to resolve the bug.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org