You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2019/06/19 11:13:00 UTC

[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

     [ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhengruifeng updated SPARK-13677:
---------------------------------
    Description: 
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library:
 sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
 xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])

I have implement it in mllib:


 val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
model1.transform(df)

val model2 : GBTClassificationModel = ...
model2.transform(df)

 

 

design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing

  was:
It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in two famous library:
sklearn (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
xgboost (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)

I have implement it in mllib:

val features : RDD[Vector] = ...
val model1 : RandomForestModel = ...
val transformed1 : RDD[Vector] = model1.leaf(features)

val model2 : GradientBoostedTreesModel = ...
val transformed2 : RDD[Vector] = model2.leaf(features)




> Support Tree-Based Feature Transformation for ML
> ------------------------------------------------
>
>                 Key: SPARK-13677
>                 URL: https://issues.apache.org/jira/browse/SPARK-13677
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: zhengruifeng
>            Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.
> This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library:
>  sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
> I have implement it in mllib:
>  val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
> val model2 : GBTClassificationModel = ...
> model2.transform(df)
>  
>  
> design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org