You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by leifker <gi...@git.apache.org> on 2017/03/15 16:59:01 UTC

[GitHub] spark pull request #17306: [MLLIB] Allow multiple pipelines when tuning

GitHub user leifker opened a pull request:

    https://github.com/apache/spark/pull/17306

        [MLLIB] Allow multiple pipelines when tuning

        ## What changes were proposed in this pull request?
    
        Updated CrossValidator and TrainValidationSplit to be able to
        accept multiple pipelines for testing different algorithms
        and/or being able to better control tuning combinations.
        Maintains backwards compatible API and reads legacy
        serialized objects.
    
        ## How was this patch tested?
    
        Unit tests
    
        Author: David Leifker <dl...@gmail.com>


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/leifker/spark multiple-estimators

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17306
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by leifker <gi...@git.apache.org>.

Github user leifker commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Interesting, let me think about this a bit. I think that there is probably a better api around this approach for sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [MLLIB] Allow multiple pipelines when tuning

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Have a look at http://spark.apache.org/contributing.html first
    
    So the idea is to search over pipelines with different components, potentially, not just one set of components but varying its parameters?
    
    What you have here amounts to running n pipeline evaluations with n grid searches. That could just be done with the existing machinery, run n times. In this new model it seems hard to work out which pipeline was selected? except by inspecting it.
    
    It makes some sense but the alternative isn't much code either, to just combine the results of n grid searches.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [MLLIB] Allow multiple pipelines when tuning

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Yes re: JIRA
    CC @MLNick or @jkbradley as well
    
    I agree about reusing the cached folds, that's a good point. I think this is worth considering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [MLLIB] Allow multiple pipelines when tuning

Posted by leifker <gi...@git.apache.org>.

Github user leifker commented on the issue:

https://github.com/apache/spark/pull/17306

Looking over the contributing link, I should open a jira issue it seems?

The intent is like you said, to run the CrossValidator with different pipelines.

The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters.

A couple advantages to this implementation to consider come from keeping a the functional interface to the CrossValidator.

1. The caching of the folds is better utilized. An external iterative approach creates a new set of k folds for each CrossValidator fit and the folds are discarded after each CrossValidator run. In this implementation a single set of k folds is created and cached for all of the pipelines.
2. A potential advantage of using this implementation is for future parallelization of the pipelines within the CrossValdiator. It is of course possible to handle the parallelization outside of the CrossValidator here too, however I believe there is already work in progress to parallelize the grid parameters and that could be extended to multiple pipelines.

Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by leifker <gi...@git.apache.org>.

Github user leifker commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Closing this PR, will get back to this eventually, but dealing with some other priorities at the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Thanks @leifker for the PR, this is a good idea.  I think though it can already be accomplished with the current param grid builder.  Since the stages of a pipeline are actually a param, you can add these to the param grid and will evaluate the different pipelines, also reusing the cached splits in cross-val.  I tried this out by modifying the example `ModelSelectionViaCrossValidation` and it seems to work
    
    ```scala
        val tokenizer = new Tokenizer()
          .setInputCol("text")
          .setOutputCol("words")
        val hashingTF = new HashingTF()
          .setInputCol(tokenizer.getOutputCol)
          .setOutputCol("features")
        val lr = new LogisticRegression()
          .setMaxIter(10)
        val dt = new DecisionTreeClassifier()
          .setMaxDepth(5)
        val pipeline = new Pipeline()
    
        val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
        val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)
    
        val paramGrid = new ParamGridBuilder()
          .addGrid[Array[PipelineStage]](pipeline.stages, Array(pipeline1, pipeline2))
          .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
          .addGrid(lr.regParam, Array(0.1, 0.01))
          .build()
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17306

I commented on the linked JIRA also.

In principle I think this can be a useful enhancement and yes the better efficiency on the caching side is a good benefit. I'd actually been thinking about a (simpler) version that would allow multiple`Predictors` to be the last stages of a pipeline, and return the best from those - so evaluating a set of algorithms at the same time (but keeping the preceding pipeline stages common).

This is a generalization of that idea.

I think it gels with [SPARK-19071](https://issues.apache.org/jira/browse/SPARK-19071) and the next phase of that work that would look at smarter caching & re-use of stages in the pipeline. Also related to the subtask of that ticket - the parallel CV work in [SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357).

Since Spark 2.2 code freeze is imminent, I think this will have to be delayed a little but it would be good to coordinate with #16774 where appropriate.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by leifker <gi...@git.apache.org>.

Github user leifker commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Sorry of the delayed response @BryanCutler, that's pretty neat, however this will perform unneeded work as it will execute nonsensical combinations of parameters. For example, if pipeline2 is executed (decision tree), the code will run the pipeline with lr.regParam set to 0.1 and 0.01 which doesn't matter since we're using the decision tree algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [MLLIB] Allow multiple pipelines when tuning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17306: [SPARK-19979][MLLIB] Allow multiple pipelines whe...

Posted by leifker <gi...@git.apache.org>.

Github user leifker closed the pull request at:

    https://github.com/apache/spark/pull/17306


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17306: [SPARK-19979][MLLIB] Allow multiple pipelines when tunin...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Yeah, that's true in this case.  You could just build the grids separately and combine them like this
    
    ```scala
        val pipeline1_grid = new ParamGridBuilder()
          .baseOn(pipeline.stages -> pipeline1)
          .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
          .addGrid(lr.regParam, Array(0.1, 0.01))
          .build()
    
        val pipeline2_grid = new ParamGridBuilder()
          .baseOn(pipeline.stages -> pipeline2)
          .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
          .build()
    
        val paramGrid = pipeline1_grid ++ pipeline2_grid
    ```
    Maybe not all that intuitive though... 
    It might be worth it to add some smarts in the `ParamGridBuilder` or `CrossValidator` to prune the grid if there is a pipeline and params not used in any of the stages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org