You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Herman van Hovell tot Westerflier (JIRA)" <ji...@apache.org> on 2015/05/10 18:23:00 UTC

[jira] [Commented] (SPARK-5886) Add StringIndexer

    [ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537252#comment-14537252 ] 

Herman van Hovell tot Westerflier commented on SPARK-5886:
----------------------------------------------------------

Currently the StringIndexerModel.transformSchema(...) returns the same schema as the StringIndexer. The returned schema does not contain the label values, which the Model has access to. This prevents me from doing the following:
{code:none}
val dataset:DataFrame = ???
val indexer = new StringIndexer().
  setInputCol("A").
  setOutputCol("A_indexed")
val indexerModel = indexer.fit(dataset)

val encoder = new OneHotEncoder().
  setInputCol("A_indexed").
  setOutputCol("A_one_hot")

val pipeline = new Pipeline().setStages(Array(indexerModel, encoder))
val pipelineModel = pipeline.fit(dataset)
{code}
Is it possible to change this? If needed I can take a stab at it.

> Add StringIndexer
> -----------------
>
>                 Key: SPARK-5886
>                 URL: https://issues.apache.org/jira/browse/SPARK-5886
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>             Fix For: 1.4.0
>
>
> `StringIndexer` takes a column of string labels (raw categories) and outputs an integer column with labels indexed by their frequency.
> {code}
> va li = new StringIndexer()
>   .setInputCol("country")
>   .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org