You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/12/13 09:19:56 UTC

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19350#discussion_r156599955
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -82,5 +86,49 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
         paramMaps.map(fit(dataset, _))
       }
     
    +  /**
    +   * (Java-specific)
    +   */
    +  @Since("2.3.0")
    +  def fit(dataset: Dataset[_], paramMaps: Array[ParamMap],
    +    unpersistDatasetAfterFitting: Boolean, executionContext: ExecutionContext,
    +    modelCallback: VoidFunction2[Model[_], Int]): Unit = {
    +    // Fit models in a Future for training in parallel
    +    val modelFutures = paramMaps.map { paramMap =>
    +      Future[Model[_]] {
    +        fit(dataset, paramMap).asInstanceOf[Model[_]]
    --- End diff --
    
    How will this work in a pipeline?
    
    If the `Estimator` in CV is a `Pipeline`, then here it will call `fit(dataset, paramMap)` on the `Pipeline` which will in turn fit on each stage with that `paramMap`. This is what the current parallel CV is doing.
    
    But if we have a stage with model-specific optimization (let's say for arguments sake a `LinearRegression` that can internally optimize `maxIter`) then its `fit` will be called with only a single `paramMap` arg.
    
    So that pushing the parallel fit into `Estimator` nullifies any benefit from model-specific optimizations? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org