You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/09/24 07:08:04 UTC
[jira] [Created] (SPARK-10788) Decision Tree duplicates bins for
unordered categorical features
Joseph K. Bradley created SPARK-10788:
-----------------------------------------
Summary: Decision Tree duplicates bins for unordered categorical features
Key: SPARK-10788
URL: https://issues.apache.org/jira/browse/SPARK-10788
Project: Spark
Issue Type: Improvement
Components: ML
Reporter: Joseph K. Bradley
Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C
This means we communicate twice as much data as needed for these features.
We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org