You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/04/23 00:50:12 UTC

[jira] [Created] (SPARK-14862) Tree and ensemble classification: do not require label metadata

Joseph K. Bradley created SPARK-14862:
-----------------------------------------

             Summary: Tree and ensemble classification: do not require label metadata
                 Key: SPARK-14862
                 URL: https://issues.apache.org/jira/browse/SPARK-14862
             Project: Spark
          Issue Type: Improvement
          Components: ML
            Reporter: Joseph K. Bradley


spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier require that the labelCol have metadata specifying the number of classes.  Instead, if the number of classes is not specified, we should automatically scan the column to identify numClasses.

Note: This could cause problems with very small datasets + cross validation if there are k classes but class index k-1 does not appear in the training data.  We should make sure the error thrown helps the user understand the solution, which is probably to use StringIndexer to index the whole dataset's labelCol before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org