You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/05/03 09:56:04 UTC
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split
values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-16957:
------------------------------
Priority: Minor (was: Trivial)
> Use weighted midpoints for split values.
> ----------------------------------------
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Vladimir Feinberg
> Assignee: Yan Facai (颜发才)
> Priority: Minor
> Fix For: 2.3.0
>
>
> We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}.
> Example:
> {code}
> +--------+--------+-----+-----+
> |feature0|feature1|label|count|
> +--------+--------+-----+-----+
> | 0.0| 0.0| 0.0| 23|
> | 1.0| 0.0| 0.0| 2|
> | 0.0| 0.0| 1.0| 2|
> | 0.0| 1.0| 0.0| 7|
> | 1.0| 0.0| 1.0| 23|
> | 0.0| 1.0| 1.0| 18|
> | 1.0| 1.0| 1.0| 7|
> | 1.0| 1.0| 0.0| 18|
> +--------+--------+-----+-----+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
> If (feature 0 <= 0.0)
> If (feature 1 <= 0.0)
> Predict: -0.56
> Else (feature 1 > 0.0)
> Predict: 0.29333333333333333
> Else (feature 0 > 0.0)
> If (feature 1 <= 0.0)
> Predict: 0.56
> Else (feature 1 > 0.0)
> Predict: -0.29333333333333333
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org