You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2017/11/06 10:19:00 UTC

[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

    [ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240106#comment-16240106 ] 

Weichen Xu commented on SPARK-3383:
-----------------------------------

[~facai] Oh I did not notice you have commented here. I think your idea mentioned above is exactly the same with what is done in my PR 
https://github.com/apache/spark/pull/19666
So would you mind help review it ? Thanks!


> DecisionTree aggregate size could be smaller
> --------------------------------------------
>
>                 Key: SPARK-3383
>                 URL: https://issues.apache.org/jira/browse/SPARK-3383
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  The savings would be significant for datasets with many low-arity categorical features (binary features, or unordered categorical features).  Savings would be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, bin).  We could store 1 fewer bin per (node, feature): For a given (node, feature), if we store these vectors for all but the last bin, and also store the total statistics for each node, then we could compute the statistics for the last bin.  For binary and unordered categorical features, this would cut in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org