You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:04:41 UTC
[jira] [Updated] (SPARK-18348) Improve tree ensemble model summary

     [ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-18348:
---------------------------------
    Labels: bulk-closed  (was: )

> Improve tree ensemble model summary
> -----------------------------------
>
>                 Key: SPARK-18348
>                 URL: https://issues.apache.org/jira/browse/SPARK-18348
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SparkR
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Felix Cheung
>            Priority: Major
>              Labels: bulk-closed
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
>     method = "class")
>   n= 81
>           CP nsplit rel error    xerror      xstd
> 1 0.17647059      0 1.0000000 1.0000000 0.2155872
> 2 0.01960784      1 0.8235294 0.9411765 0.2107780
> 3 0.01000000      4 0.7647059 1.0588235 0.2200975
> Variable importance
>  Start    Age Number
>     64     24     12
> Node number 1: 81 observations,    complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
>     class counts:    64    17
>    probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>       Start  < 8.5  to the right, improve=6.762330, (0 missing)
>       Number < 5.5  to the left,  improve=2.866795, (0 missing)
>       Age    < 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>       Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,    complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
>     class counts:    56     6
>    probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>       Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>       Age    < 55   to the left,  improve=0.6848635, (0 missing)
>       Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>       Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>       Age    < 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
>     class counts:     8    11
>    probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
>     class counts:    29     0
>    probabilities: 1.000 0.000
> Node number 5: 33 observations,    complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
>     class counts:    27     6
>    probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>       Age    < 55   to the left,  improve=1.2467530, (0 missing)
>       Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>       Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>       Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>       Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
>     class counts:    12     0
>    probabilities: 1.000 0.000
> Node number 11: 21 observations,    complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
>     class counts:    15     6
>    probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>       Age    < 111  to the right, improve=1.71428600, (0 missing)
>       Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>       Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
>     class counts:    12     2
>    probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
>     class counts:     3     4
>    probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org