You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/08/14 20:22:12 UTC

[jira] [Created] (SPARK-3043) DecisionTree aggregation is inefficient

Joseph K. Bradley created SPARK-3043:
----------------------------------------

             Summary: DecisionTree aggregation is inefficient
                 Key: SPARK-3043
                 URL: https://issues.apache.org/jira/browse/SPARK-3043
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Joseph K. Bradley


2 major efficiency issues in computation and storage:

(1) DecisionTree aggregation involves reshaping data unnecessarily.

E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation.

(2) DecisionTree splits and aggregate bins can include many unused bins/splits.

The same number of splits/bins are used for all features.  E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org