You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 05:37:26 UTC

[jira] [Resolved] (SPARK-3163) Separate continuous and categorical features in DecisionTree

     [ https://issues.apache.org/jira/browse/SPARK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-3163.
---------------------------------
    Resolution: Incomplete

> Separate continuous and categorical features in DecisionTree
> ------------------------------------------------------------
>
>                 Key: SPARK-3163
>                 URL: https://issues.apache.org/jira/browse/SPARK-3163
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>              Labels: bulk-closed
>
> Improvement: code clarity, memory usage
> Currently, during DecisionTree training, some internal data structures have overloaded meanings and unused values.  These data structures are shared for all types of features, but they are used differently for different types of features.
> Data structures: Split, Bins, aggregates
> Feature types: continuous, ordered categorical, and unordered categorical
> This causes a couple of issues:
> (1) Overloading the meaning of these data (for different types of features) makes the code difficult to understand.
> (2) This leads to extra storage (e.g., unused lowSplit for some categorical features), and extra computation (e.g., findAggForUnorderedFeatureClassification simply reshapes data).
> Proposed fix: Use different storage formats to save space and separate out these semantically different types.
> A related issue which could be fixed simultaneously is that multiple copies of splits (about 3) are kept.
> Currently: Splits and bins are stored separately and together.  I.e., there are separate splits and bins arrays, but bins also store copies of splits. (Total: 3 copies of each split.)
> Possible fix: Keep separate arrays of splits, bins.  Do not store splits in bins.  There is a simple correspondence, so it would be easy to match splits to bins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org