You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/10/01 20:34:27 UTC

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

     [ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-10788:
--------------------------------------
    Description: 
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features.  Here's an example.

Say there are 3 categories A, B, C.  We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6).  However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) - stats(A)}}.

We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).

  was:
Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).


> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
>                 Key: SPARK-10788
>                 URL: https://issues.apache.org/jira/browse/SPARK-10788
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6).  However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org