You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Manoj Kumar (JIRA)" <ji...@apache.org> on 2015/07/13 22:36:04 UTC

[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

    [ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625289#comment-14625289 ] 

Manoj Kumar commented on SPARK-7126:
------------------------------------

[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent label the smallest.

2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers)


> For spark.ml Classifiers, automatically index labels if they are not yet indexed
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-7126
>                 URL: https://issues.apache.org/jira/browse/SPARK-7126
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>
> Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed.
> This would require a bit of design:
> * Should predict() output the original labels or the indices?
> * How should we notify users that the labels are being automatically indexed?
> * How should we provide that index to the users?
> * If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org