You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vladimir Feinberg (JIRA)" <ji...@apache.org> on 2016/07/07 18:04:11 UTC

[jira] [Comment Edited] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

    [ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502 ] 

Vladimir Feinberg edited comment on SPARK-4240 at 7/7/16 6:03 PM:
------------------------------------------------------------------

Pending some dramatic response from [~sethah] telling me to back off, I'll take over this one. [~josephkb], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired.
 * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses the first-order approximation) and the latter is equivalent to LAD_TreeBoost from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it performs worse. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for them.

Questions for [~josephkb]:
1. Should I ditch making the second order approximation that \[2\] does? It won't make the code any simpler, but might make the theoretical offerings of the new easier to grasp. This would add another task "try out second order Taylor approx" to the below, and also means we won't perform as well as xgb until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 10000)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)?
3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf



was (Author: vlad.feinberg):
Pending some dramatic response from \[~sethah\] telling me to back off, I'll take over this one. \[~josephkb\], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired.
 * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses the first-order approximation) and the latter is equivalent to LAD_TreeBoost from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it performs worse. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for them.

Questions for \[~josephkb\]:
1. Should I ditch making the second order approximation that \[2\] does? It won't make the code any simpler, but might make the theoretical offerings of the new easier to grasp. This would add another task "try out second order Taylor approx" to the below, and also means we won't perform as well as xgb until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 10000)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)?
3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-4240
>                 URL: https://issues.apache.org/jira/browse/SPARK-4240
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in each iteration using regression trees. At every iteration, the regression trees are trained/split to minimize predicted gradient variance. Additionally, the terminal node predictions are computed to minimize the prediction variance.
> However, such predictions won't be optimal for loss functions other than the mean-squared error. The TreeBoosting refinement can help mitigate this issue by modifying terminal node prediction values so that those predictions would directly minimize the actual loss function. Although this still doesn't change the fact that the tree splits were done through variance reduction, it should still lead to improvement in gradient estimations, and thus better performance.
> The details of this can be found in the R vignette. This paper also shows how to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org