You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2020/01/08 02:07:00 UTC

[jira] [Resolved] (SPARK-30381) GBT reuse treePoints for all trees

     [ https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhengruifeng resolved SPARK-30381.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 27103
[https://github.com/apache/spark/pull/27103]

> GBT reuse treePoints for all trees
> ----------------------------------
>
>                 Key: SPARK-30381
>                 URL: https://issues.apache.org/jira/browse/SPARK-30381
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In existing GBT, for each tree, it will first compute avaiable splits of each feature (via RandomForest.findSplits), based on sampled dataset at this iteration. Then it will use these splits to discretize vectors into BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) are then cached and used at this iteration. Note that the splits for discretization in each tree are different (if subsamplingRate<1), only because the sampled vectors are different.
> However, the splits at different iterations shoud be similar if sampled dataset is big enough, and even the same if subsamplingRate=1.
>  
> However, in other famous GBT impls (like XGBoost/lightGBM) with binned features, the splits for discretization is the same for different iterations:
> {code:java}
> import xgboost as xgb
> from sklearn.datasets import load_svmlight_file
> X, y = load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
> dtrain = xgb.DMatrix(X[:, :2], label=y)
> num_round = 3
> param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
> bst = xgb.train(param, dtrain, num_round)
> bst.trees_to_dataframe('/tmp/bst')
> Out[61]: 
>     Tree  Node   ID Feature     Split  Yes   No Missing        Gain  Cover
> 0      0     0  0-0      f1  0.000408  0-1  0-2     0-1  170.337143  256.0
> 1      0     1  0-1      f0  0.003531  0-3  0-4     0-3   44.865482  121.0
> 2      0     2  0-2      f0  0.003531  0-5  0-6     0-5  125.615570  135.0
> 3      0     3  0-3    Leaf       NaN  NaN  NaN     NaN   -0.010050   67.0
> 4      0     4  0-4    Leaf       NaN  NaN  NaN     NaN    0.002126   54.0
> 5      0     5  0-5    Leaf       NaN  NaN  NaN     NaN    0.020972   69.0
> 6      0     6  0-6    Leaf       NaN  NaN  NaN     NaN    0.001714   66.0
> 7      1     0  1-0      f0  0.003531  1-1  1-2     1-1   50.417793  263.0
> 8      1     1  1-1      f1  0.000408  1-3  1-4     1-3   48.732742  124.0
> 9      1     2  1-2      f1  0.000408  1-5  1-6     1-5   52.832161  139.0
> 10     1     3  1-3    Leaf       NaN  NaN  NaN     NaN   -0.012784   63.0
> 11     1     4  1-4    Leaf       NaN  NaN  NaN     NaN   -0.000287   61.0
> 12     1     5  1-5    Leaf       NaN  NaN  NaN     NaN    0.008661   64.0
> 13     1     6  1-6    Leaf       NaN  NaN  NaN     NaN   -0.003624   75.0
> 14     2     0  2-0      f1  0.000408  2-1  2-2     2-1   62.136013  242.0
> 15     2     1  2-1      f0  0.003531  2-3  2-4     2-3  150.537781  118.0
> 16     2     2  2-2      f0  0.003531  2-5  2-6     2-5    3.829046  124.0
> 17     2     3  2-3    Leaf       NaN  NaN  NaN     NaN   -0.016737   65.0
> 18     2     4  2-4    Leaf       NaN  NaN  NaN     NaN    0.005809   53.0
> 19     2     5  2-5    Leaf       NaN  NaN  NaN     NaN    0.005251   60.0
> 20     2     6  2-6    Leaf       NaN  NaN  NaN     NaN    0.001709   64.0
>  {code}
>  
> We can see that even if we set subsample=0.5, the three trees share the same splits.
>  
> So I think we could reuse the splits and treePoints at all iterations:
> at iteration=0, compute the splits on whole training dataset, and use the splits to generate treepoints.
> At each iteration, directly generate baggedPoints based on the treePoints.
> Here we do not need to persist/unpersist the internal training dataset for each tree.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org