You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/13 14:13:54 UTC

[GitHub] mgaido91 opened a new pull request #23773: [SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT

mgaido91 opened a new pull request #23773: [SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT
URL: https://github.com/apache/spark/pull/23773
 
 
   ## What changes were proposed in this pull request?
   
   Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in https://github.com/scikit-learn/scikit-learn/pull/11176). Citing the description of that PR:
   
   > Because the feature importances are (currently, by default) normalized and then averaged, feature importances from later stages are overweighted.
   
   The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT.
   
   Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper.
   
   ## How was this patch tested?
   
   modified UT, checked that the computed `featureImportance` in that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org