You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2017/12/27 20:08:27 UTC

[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

GitHub user MrBago opened a pull request:

    https://github.com/apache/spark/pull/20095

    [SPARK-22126][ML] Added fitMultiple method with default implementation

    …mator.
    
    Update TrainValidationSplit & CrossValidator to use fitMultiple method.
    
    ## What changes were proposed in this pull request?
    
    The fitMultiple API allows both model specific optimizations for fitting ML estimators & fitting models in parallel when appropriate. 
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MrBago/spark scala-fitMultipleIterator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20095
    
----
commit 1992931d08b65046d63090760297cc7d41e7373a
Author: Bago Amirbekian <ba...@...>
Date:   2017-12-27T20:03:59Z

    Added fitMultiple method with default implementation to spark.ml.Estimator.
    Update TrainValidationSplit & CrossValidator to use fitMultiple method.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r158930992
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    +
    +  /**
    +   * Fits multiple models to the input data with multiple sets of parameters. The default
    +   * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
    +   * could override this to optimize multi-model training.
    +   *
    +   * @param dataset input dataset
    +   * @param paramMaps An array of parameter maps.
    +   *                  These values override any specified in this Estimator's embedded ParamMap.
    +   * @return An iterator which produces one model per call to `next`. The models may be produced in
    +   *         a different order than the order of the parameters in paramMap. The next method of
    +   *         the iterator will return a tuple of the form `(index, model)` where model corresponds
    +   *         to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
    +   *         to `next` should always produce unique values of `index`.
    +   *
    +   * :: Experimental ::
    +   */
    +  @Experimental
    +  @Since("2.3.0")
    +  def fitMultiple(
    +      dataset: Dataset[_],
    +      paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
    +
    +    val numModel = paramMaps.length
    +    val counter = new AtomicInteger(0)
    +    new JIterator[(Integer, M)] {
    +      def next(): (Integer, M) = {
    +        val index = counter.getAndIncrement()
    +        if (index < numModel) {
    +          (index, fit(dataset, paramMaps(index)))
    +        } else {
    +          counter.set(numModel)
    +          throw new NoSuchElementException("Iterator finished.")
    +        }
    +      }
    +
    +      override def hasNext: Boolean = counter.get() < numModel
    --- End diff --
    
    Suppose we have 2 threads, and at the time the iterator remaining only one element, the 2 threads both call `hasNext` first, and all passed, and then they both call `next`, then one of the thread will throw exception. 
    Is this the expected activity ? `hasNext` return `true` but `next` is possible to fail. It seems to break the iterator API ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r159006471
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    +
    +  /**
    +   * Fits multiple models to the input data with multiple sets of parameters. The default
    +   * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
    +   * could override this to optimize multi-model training.
    +   *
    +   * @param dataset input dataset
    +   * @param paramMaps An array of parameter maps.
    +   *                  These values override any specified in this Estimator's embedded ParamMap.
    +   * @return An iterator which produces one model per call to `next`. The models may be produced in
    +   *         a different order than the order of the parameters in paramMap. The next method of
    +   *         the iterator will return a tuple of the form `(index, model)` where model corresponds
    +   *         to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
    +   *         to `next` should always produce unique values of `index`.
    +   *
    +   * :: Experimental ::
    +   */
    +  @Experimental
    +  @Since("2.3.0")
    +  def fitMultiple(
    +      dataset: Dataset[_],
    +      paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
    +
    +    val numModel = paramMaps.length
    +    val counter = new AtomicInteger(0)
    +    new JIterator[(Integer, M)] {
    +      def next(): (Integer, M) = {
    +        val index = counter.getAndIncrement()
    +        if (index < numModel) {
    +          (index, fit(dataset, paramMaps(index)))
    +        } else {
    +          counter.set(numModel)
    +          throw new NoSuchElementException("Iterator finished.")
    +        }
    +      }
    +
    +      override def hasNext: Boolean = counter.get() < numModel
    --- End diff --
    
    This is true for any concurrent class with multiple methods, for example take `scala.collections.concurrent.Map`. If I were to write `if (! myMap.contains(key)) { myMap += (key, value) }` I could not guarantee that `key` was not added between my calls to `contain` & `+=`.
    
    In the multithreaded case folks will need to do something like this:
    
    ```
    try {
      while (true) {
        val next = iter.next()
      }
    } catch (NoSuchElementException)
    ```
    
    or simply count the number of calls to `next` and ensure that it's equal to the number of `paramMaps` passes to `fitMultiple` (that's mostly what I do in this PR).
    
    We need to define `hasNext` to implement the Java Iterator interface and it could be useful in the single threaded case, but we could drop `hasNext` if using a Java Iterator doesn't feel important.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    @jkbradley I pushed changes in response to your comments. I think we should split the `TrainValidationSplit` memory split into another PR, I may have time to work on it tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85441/testReport)** for PR 20095 at commit [`1992931`](https://github.com/apache/spark/commit/1992931d08b65046d63090760297cc7d41e7373a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r186381507
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    --- End diff --
    
    Seemingly we should use the following note to deprecate it:
    """
    .. note:: Deprecated in 2.3.0. Use :func:`Estimator.fitMultiple` instead.
    """
    and like other places, add a warning
    ```
    warnings.warn("Deprecated in 2.3.0. Use Estimator.fitMultiple instead.", DeprecationWarning)
    ```
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r159131555
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    --- End diff --
    
    Let's deprecate this method, say it will be removed in the 3.0.0 release, and tell users and developers to use fitMultiple instead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r158931079
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    +
    +  /**
    +   * Fits multiple models to the input data with multiple sets of parameters. The default
    +   * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
    +   * could override this to optimize multi-model training.
    +   *
    +   * @param dataset input dataset
    +   * @param paramMaps An array of parameter maps.
    +   *                  These values override any specified in this Estimator's embedded ParamMap.
    +   * @return An iterator which produces one model per call to `next`. The models may be produced in
    +   *         a different order than the order of the parameters in paramMap. The next method of
    +   *         the iterator will return a tuple of the form `(index, model)` where model corresponds
    +   *         to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
    +   *         to `next` should always produce unique values of `index`.
    +   *
    +   * :: Experimental ::
    +   */
    +  @Experimental
    +  @Since("2.3.0")
    +  def fitMultiple(
    +      dataset: Dataset[_],
    +      paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
    +
    +    val numModel = paramMaps.length
    +    val counter = new AtomicInteger(0)
    +    new JIterator[(Integer, M)] {
    +      def next(): (Integer, M) = {
    +        val index = counter.getAndIncrement()
    +        if (index < numModel) {
    +          (index, fit(dataset, paramMaps(index)))
    +        } else {
    +          counter.set(numModel)
    --- End diff --
    
    This `counter.set` seems to be meaningless. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85489/testReport)** for PR 20095 at commit [`cf89704`](https://github.com/apache/spark/commit/cf89704252e2436ee5333db7fdf6d55ffc094634).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85489/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r159011656
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    +
    +  /**
    +   * Fits multiple models to the input data with multiple sets of parameters. The default
    +   * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
    +   * could override this to optimize multi-model training.
    +   *
    +   * @param dataset input dataset
    +   * @param paramMaps An array of parameter maps.
    +   *                  These values override any specified in this Estimator's embedded ParamMap.
    +   * @return An iterator which produces one model per call to `next`. The models may be produced in
    +   *         a different order than the order of the parameters in paramMap. The next method of
    +   *         the iterator will return a tuple of the form `(index, model)` where model corresponds
    +   *         to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
    +   *         to `next` should always produce unique values of `index`.
    +   *
    +   * :: Experimental ::
    +   */
    +  @Experimental
    +  @Since("2.3.0")
    +  def fitMultiple(
    +      dataset: Dataset[_],
    +      paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
    +
    +    val numModel = paramMaps.length
    +    val counter = new AtomicInteger(0)
    +    new JIterator[(Integer, M)] {
    +      def next(): (Integer, M) = {
    +        val index = counter.getAndIncrement()
    +        if (index < numModel) {
    +          (index, fit(dataset, paramMaps(index)))
    +        } else {
    +          counter.set(numModel)
    --- End diff --
    
    I've updated this part to use `compareAndSet` because it's more correct and it makes the intentions here clear :).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85441/testReport)** for PR 20095 at commit [`1992931`](https://github.com/apache/spark/commit/1992931d08b65046d63090760297cc7d41e7373a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85488/testReport)** for PR 20095 at commit [`aab321b`](https://github.com/apache/spark/commit/aab321bccdadbb2255530ab8d594554ae30923d5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85572/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85571/testReport)** for PR 20095 at commit [`4d10c19`](https://github.com/apache/spark/commit/4d10c19e004e6f7b265f7028accbda6d83af8ae4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85441/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85572/testReport)** for PR 20095 at commit [`8c7c8e3`](https://github.com/apache/spark/commit/8c7c8e3e14c6c2d7f64c5a3f54c129e794176672).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85488/testReport)** for PR 20095 at commit [`aab321b`](https://github.com/apache/spark/commit/aab321bccdadbb2255530ab8d594554ae30923d5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r158929523
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    --- End diff --
    
    style: blank `.map {...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85572/testReport)** for PR 20095 at commit [`8c7c8e3`](https://github.com/apache/spark/commit/8c7c8e3e14c6c2d7f64c5a3f54c129e794176672).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85489/testReport)** for PR 20095 at commit [`cf89704`](https://github.com/apache/spark/commit/cf89704252e2436ee5333db7fdf6d55ffc094634).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85487/testReport)** for PR 20095 at commit [`0b85a60`](https://github.com/apache/spark/commit/0b85a603f5ce12170f4fcda62ffc6280ad70caab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2879/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85487/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85488/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    **[Test build #85487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85487/testReport)** for PR 20095 at commit [`0b85a60`](https://github.com/apache/spark/commit/0b85a603f5ce12170f4fcda62ffc6280ad70caab).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r159127966
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map { _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    --- End diff --
    
    style: indentation


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20095
  
    ping @MrBago 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20095#discussion_r159007817
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
    @@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
        */
       @Since("2.0.0")
       def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
    -    paramMaps.map(fit(dataset, _))
    +    val modelIter = fitMultiple(dataset, paramMaps)
    +    val models = paramMaps.map{ _ =>
    +      val (index, model) = modelIter.next()
    +      (index.toInt, model)
    +    }
    +    paramMaps.indices.map(models.toMap)
    +}
    +
    +  /**
    +   * Fits multiple models to the input data with multiple sets of parameters. The default
    +   * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
    +   * could override this to optimize multi-model training.
    +   *
    +   * @param dataset input dataset
    +   * @param paramMaps An array of parameter maps.
    +   *                  These values override any specified in this Estimator's embedded ParamMap.
    +   * @return An iterator which produces one model per call to `next`. The models may be produced in
    +   *         a different order than the order of the parameters in paramMap. The next method of
    +   *         the iterator will return a tuple of the form `(index, model)` where model corresponds
    +   *         to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
    +   *         to `next` should always produce unique values of `index`.
    +   *
    +   * :: Experimental ::
    +   */
    +  @Experimental
    +  @Since("2.3.0")
    +  def fitMultiple(
    +      dataset: Dataset[_],
    +      paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
    +
    +    val numModel = paramMaps.length
    +    val counter = new AtomicInteger(0)
    +    new JIterator[(Integer, M)] {
    +      def next(): (Integer, M) = {
    +        val index = counter.getAndIncrement()
    +        if (index < numModel) {
    +          (index, fit(dataset, paramMaps(index)))
    +        } else {
    +          counter.set(numModel)
    --- End diff --
    
    I don't think it's very important but I'm setting the counter here because on the call to `val index = counter.getAndIncrement()` the counter gets incremented even if we throw an exception. Eventually the counter  "could overflow" ..., I don't think it's a major concern but that's why the set is here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org