You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2020/01/08 02:07:00 UTC
[jira] [Resolved] (SPARK-30381) GBT reuse treePoints for all trees
[ https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng resolved SPARK-30381.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 27103
[https://github.com/apache/spark/pull/27103]
> GBT reuse treePoints for all trees
> ----------------------------------
>
> Key: SPARK-30381
> URL: https://issues.apache.org/jira/browse/SPARK-30381
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Major
> Fix For: 3.0.0
>
>
> In existing GBT, for each tree, it will first compute avaiable splits of each feature (via RandomForest.findSplits), based on sampled dataset at this iteration. Then it will use these splits to discretize vectors into BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) are then cached and used at this iteration. Note that the splits for discretization in each tree are different (if subsamplingRate<1), only because the sampled vectors are different.
> However, the splits at different iterations shoud be similar if sampled dataset is big enough, and even the same if subsamplingRate=1.
>
> However, in other famous GBT impls (like XGBoost/lightGBM) with binned features, the splits for discretization is the same for different iterations:
> {code:java}
> import xgboost as xgb
> from sklearn.datasets import load_svmlight_file
> X, y = load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
> dtrain = xgb.DMatrix(X[:, :2], label=y)
> num_round = 3
> param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
> bst = xgb.train(param, dtrain, num_round)
> bst.trees_to_dataframe('/tmp/bst')
> Out[61]:
> Tree Node ID Feature Split Yes No Missing Gain Cover
> 0 0 0 0-0 f1 0.000408 0-1 0-2 0-1 170.337143 256.0
> 1 0 1 0-1 f0 0.003531 0-3 0-4 0-3 44.865482 121.0
> 2 0 2 0-2 f0 0.003531 0-5 0-6 0-5 125.615570 135.0
> 3 0 3 0-3 Leaf NaN NaN NaN NaN -0.010050 67.0
> 4 0 4 0-4 Leaf NaN NaN NaN NaN 0.002126 54.0
> 5 0 5 0-5 Leaf NaN NaN NaN NaN 0.020972 69.0
> 6 0 6 0-6 Leaf NaN NaN NaN NaN 0.001714 66.0
> 7 1 0 1-0 f0 0.003531 1-1 1-2 1-1 50.417793 263.0
> 8 1 1 1-1 f1 0.000408 1-3 1-4 1-3 48.732742 124.0
> 9 1 2 1-2 f1 0.000408 1-5 1-6 1-5 52.832161 139.0
> 10 1 3 1-3 Leaf NaN NaN NaN NaN -0.012784 63.0
> 11 1 4 1-4 Leaf NaN NaN NaN NaN -0.000287 61.0
> 12 1 5 1-5 Leaf NaN NaN NaN NaN 0.008661 64.0
> 13 1 6 1-6 Leaf NaN NaN NaN NaN -0.003624 75.0
> 14 2 0 2-0 f1 0.000408 2-1 2-2 2-1 62.136013 242.0
> 15 2 1 2-1 f0 0.003531 2-3 2-4 2-3 150.537781 118.0
> 16 2 2 2-2 f0 0.003531 2-5 2-6 2-5 3.829046 124.0
> 17 2 3 2-3 Leaf NaN NaN NaN NaN -0.016737 65.0
> 18 2 4 2-4 Leaf NaN NaN NaN NaN 0.005809 53.0
> 19 2 5 2-5 Leaf NaN NaN NaN NaN 0.005251 60.0
> 20 2 6 2-6 Leaf NaN NaN NaN NaN 0.001709 64.0
> {code}
>
> We can see that even if we set subsample=0.5, the three trees share the same splits.
>
> So I think we could reuse the splits and treePoints at all iterations:
> at iteration=0, compute the splits on whole training dataset, and use the splits to generate treepoints.
> At each iteration, directly generate baggedPoints based on the treePoints.
> Here we do not need to persist/unpersist the internal training dataset for each tree.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org