You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "yuhao yang (JIRA)" <ji...@apache.org> on 2016/12/07 01:48:58 UTC

[jira] [Updated] (SPARK-18755) Add Randomized Grid Search to Spark ML

     [ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

yuhao yang updated SPARK-18755:
-------------------------------
    Description: 
Randomized Grid Search  implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible values.
2. Adding parameters that do not influence the performance does not decrease efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. Only 1 new public function is required.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit, which can be complicated since we need to deal with the models.

I'd prefer option 1 as it's much simpler and straightforward.


  was:
Randomized Grid Search  implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible values.
2. Adding parameters that do not influence the performance does not decrease efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit.

I'd prefer option 1 as it's much simpler and straightforward.



> Add Randomized Grid Search to Spark ML
> --------------------------------------
>
>                 Key: SPARK-18755
>                 URL: https://issues.apache.org/jira/browse/SPARK-18755
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and possible values.
> 2. Adding parameters that do not influence the performance does not decrease efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit, which can be complicated since we need to deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org