You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/08/21 00:25:31 UTC

[jira] [Created] (SPARK-3163) Separate continuous and categorical features in DecisionTree

Joseph K. Bradley created SPARK-3163:
----------------------------------------

             Summary: Separate continuous and categorical features in DecisionTree
                 Key: SPARK-3163
                 URL: https://issues.apache.org/jira/browse/SPARK-3163
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Joseph K. Bradley
            Priority: Minor


Improvement: code clarity, memory usage

Currently, during DecisionTree training, some internal data structures have overloaded meanings and unused values.  These data structures are shared for all types of features, but they are used differently for different types of features.
Data structures: Split, Bins, aggregates
Feature types: continuous, ordered categorical, and unordered categorical
This causes a couple of issues:
(1) Overloading the meaning of these data (for different types of features) makes the code difficult to understand.
(2) This leads to extra storage (e.g., unused lowSplit for some categorical features), and extra computation (e.g., findAggForUnorderedFeatureClassification simply reshapes data).

Proposed fix: Use different storage formats to save space and separate out these semantically different types.

A related issue which could be fixed simultaneously is that multiple copies of splits (about 3) are kept.
Currently: Splits and bins are stored separately and together.  I.e., there are separate splits and bins arrays, but bins also store copies of splits. (Total: 3 copies of each split.)
Possible fix: Keep separate arrays of splits, bins.  Do not store splits in bins.  There is a simple correspondence, so it would be easy to match splits to bins.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org