You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/09/24 07:08:04 UTC

[jira] [Created] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

Joseph K. Bradley created SPARK-10788:
-----------------------------------------

             Summary: Decision Tree duplicates bins for unordered categorical features
                 Key: SPARK-10788
                 URL: https://issues.apache.org/jira/browse/SPARK-10788
             Project: Spark
          Issue Type: Improvement
          Components: ML
            Reporter: Joseph K. Bradley


Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org