You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2019/12/29 11:05:00 UTC
[jira] [Created] (SPARK-30381) GBT reuse splits for all trees

zhengruifeng created SPARK-30381:
------------------------------------

             Summary: GBT reuse splits for all trees
                 Key: SPARK-30381
                 URL: https://issues.apache.org/jira/browse/SPARK-30381
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng
            Assignee: zhengruifeng


In existing GBT, for each tree, it will first compute avaiable splits of each feature (via RandomForest.findSplits), based on sampled dataset at this iteration. Then it will use these splits to discretize vectors into BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) are then cached and used at this iteration. Note that the splits for discretization in each tree are different (if subsamplingRate<1), only because the sampled vectors are different.

However, the splits at different iterations shoud be similar if sampled dataset is big enough, and even the same if subsamplingRate=1.

 

However, in other famous GBT impls (like XGBoost/lightGBM) with binned features, the splits for discretization is the same for different iterations:
{code:java}
import xgboost as xgb
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
dtrain = xgb.DMatrix(X[:, :2], label=y)
num_round = 3
param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
bst = xgb.train(param, dtrain, num_round)
bst.trees_to_dataframe('/tmp/bst')
Out[61]: 
    Tree  Node   ID Feature     Split  Yes   No Missing        Gain  Cover
0      0     0  0-0      f1  0.000408  0-1  0-2     0-1  170.337143  256.0
1      0     1  0-1      f0  0.003531  0-3  0-4     0-3   44.865482  121.0
2      0     2  0-2      f0  0.003531  0-5  0-6     0-5  125.615570  135.0
3      0     3  0-3    Leaf       NaN  NaN  NaN     NaN   -0.010050   67.0
4      0     4  0-4    Leaf       NaN  NaN  NaN     NaN    0.002126   54.0
5      0     5  0-5    Leaf       NaN  NaN  NaN     NaN    0.020972   69.0
6      0     6  0-6    Leaf       NaN  NaN  NaN     NaN    0.001714   66.0
7      1     0  1-0      f0  0.003531  1-1  1-2     1-1   50.417793  263.0
8      1     1  1-1      f1  0.000408  1-3  1-4     1-3   48.732742  124.0
9      1     2  1-2      f1  0.000408  1-5  1-6     1-5   52.832161  139.0
10     1     3  1-3    Leaf       NaN  NaN  NaN     NaN   -0.012784   63.0
11     1     4  1-4    Leaf       NaN  NaN  NaN     NaN   -0.000287   61.0
12     1     5  1-5    Leaf       NaN  NaN  NaN     NaN    0.008661   64.0
13     1     6  1-6    Leaf       NaN  NaN  NaN     NaN   -0.003624   75.0
14     2     0  2-0      f1  0.000408  2-1  2-2     2-1   62.136013  242.0
15     2     1  2-1      f0  0.003531  2-3  2-4     2-3  150.537781  118.0
16     2     2  2-2      f0  0.003531  2-5  2-6     2-5    3.829046  124.0
17     2     3  2-3    Leaf       NaN  NaN  NaN     NaN   -0.016737   65.0
18     2     4  2-4    Leaf       NaN  NaN  NaN     NaN    0.005809   53.0
19     2     5  2-5    Leaf       NaN  NaN  NaN     NaN    0.005251   60.0
20     2     6  2-6    Leaf       NaN  NaN  NaN     NaN    0.001709   64.0
 {code}
 

We can see that even if we set subsample=0.5, the three trees share the same splits.

 

So I think we could reuse the splits and treePoints at all iterations:

at iteration=0, compute the splits on whole training dataset, and use the splits to generate treepoints.

At each iteration, directly generate baggedPoints based on the treePoints.

Here we do not need to persist/unpersist the internal training dataset for each tree.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org