You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/10/01 20:30:26 UTC
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for
unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940186#comment-14940186 ]
Joseph K. Bradley commented on SPARK-10788:
-------------------------------------------
Reading what I wrote now, I realize I didn't actually phrase it correctly. I'll update the description.
> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we also consider the 3 flipped splits:
> * B,C vs. A
> * C vs. A, B
> * B vs. A, C
> This means we communicate twice as much data as needed for these features.
> We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org