You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2017/12/27 20:08:27 UTC
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
GitHub user MrBago opened a pull request:
https://github.com/apache/spark/pull/20095
[SPARK-22126][ML] Added fitMultiple method with default implementation
…mator.
Update TrainValidationSplit & CrossValidator to use fitMultiple method.
## What changes were proposed in this pull request?
The fitMultiple API allows both model specific optimizations for fitting ML estimators & fitting models in parallel when appropriate.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MrBago/spark scala-fitMultipleIterator
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20095.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20095
----
commit 1992931d08b65046d63090760297cc7d41e7373a
Author: Bago Amirbekian <ba...@...>
Date: 2017-12-27T20:03:59Z
Added fitMultiple method with default implementation to spark.ml.Estimator.
Update TrainValidationSplit & CrossValidator to use fitMultiple method.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r158930992
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
+
+ /**
+ * Fits multiple models to the input data with multiple sets of parameters. The default
+ * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
+ * could override this to optimize multi-model training.
+ *
+ * @param dataset input dataset
+ * @param paramMaps An array of parameter maps.
+ * These values override any specified in this Estimator's embedded ParamMap.
+ * @return An iterator which produces one model per call to `next`. The models may be produced in
+ * a different order than the order of the parameters in paramMap. The next method of
+ * the iterator will return a tuple of the form `(index, model)` where model corresponds
+ * to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
+ * to `next` should always produce unique values of `index`.
+ *
+ * :: Experimental ::
+ */
+ @Experimental
+ @Since("2.3.0")
+ def fitMultiple(
+ dataset: Dataset[_],
+ paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
+
+ val numModel = paramMaps.length
+ val counter = new AtomicInteger(0)
+ new JIterator[(Integer, M)] {
+ def next(): (Integer, M) = {
+ val index = counter.getAndIncrement()
+ if (index < numModel) {
+ (index, fit(dataset, paramMaps(index)))
+ } else {
+ counter.set(numModel)
+ throw new NoSuchElementException("Iterator finished.")
+ }
+ }
+
+ override def hasNext: Boolean = counter.get() < numModel
--- End diff --
Suppose we have 2 threads, and at the time the iterator remaining only one element, the 2 threads both call `hasNext` first, and all passed, and then they both call `next`, then one of the thread will throw exception.
Is this the expected activity ? `hasNext` return `true` but `next` is possible to fail. It seems to break the iterator API ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r159006471
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
+
+ /**
+ * Fits multiple models to the input data with multiple sets of parameters. The default
+ * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
+ * could override this to optimize multi-model training.
+ *
+ * @param dataset input dataset
+ * @param paramMaps An array of parameter maps.
+ * These values override any specified in this Estimator's embedded ParamMap.
+ * @return An iterator which produces one model per call to `next`. The models may be produced in
+ * a different order than the order of the parameters in paramMap. The next method of
+ * the iterator will return a tuple of the form `(index, model)` where model corresponds
+ * to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
+ * to `next` should always produce unique values of `index`.
+ *
+ * :: Experimental ::
+ */
+ @Experimental
+ @Since("2.3.0")
+ def fitMultiple(
+ dataset: Dataset[_],
+ paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
+
+ val numModel = paramMaps.length
+ val counter = new AtomicInteger(0)
+ new JIterator[(Integer, M)] {
+ def next(): (Integer, M) = {
+ val index = counter.getAndIncrement()
+ if (index < numModel) {
+ (index, fit(dataset, paramMaps(index)))
+ } else {
+ counter.set(numModel)
+ throw new NoSuchElementException("Iterator finished.")
+ }
+ }
+
+ override def hasNext: Boolean = counter.get() < numModel
--- End diff --
This is true for any concurrent class with multiple methods, for example take `scala.collections.concurrent.Map`. If I were to write `if (! myMap.contains(key)) { myMap += (key, value) }` I could not guarantee that `key` was not added between my calls to `contain` & `+=`.
In the multithreaded case folks will need to do something like this:
```
try {
while (true) {
val next = iter.next()
}
} catch (NoSuchElementException)
```
or simply count the number of calls to `next` and ensure that it's equal to the number of `paramMaps` passes to `fitMultiple` (that's mostly what I do in this PR).
We need to define `hasNext` to implement the Java Iterator interface and it could be useful in the single threaded case, but we could drop `hasNext` if using a Java Iterator doesn't feel important.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/20095
@jkbradley I pushed changes in response to your comments. I think we should split the `TrainValidationSplit` memory split into another PR, I may have time to work on it tomorrow.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85441/testReport)** for PR 20095 at commit [`1992931`](https://github.com/apache/spark/commit/1992931d08b65046d63090760297cc7d41e7373a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r186381507
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
--- End diff --
Seemingly we should use the following note to deprecate it:
"""
.. note:: Deprecated in 2.3.0. Use :func:`Estimator.fitMultiple` instead.
"""
and like other places, add a warning
```
warnings.warn("Deprecated in 2.3.0. Use Estimator.fitMultiple instead.", DeprecationWarning)
```
?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r159131555
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
--- End diff --
Let's deprecate this method, say it will be removed in the 3.0.0 release, and tell users and developers to use fitMultiple instead.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r158931079
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
+
+ /**
+ * Fits multiple models to the input data with multiple sets of parameters. The default
+ * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
+ * could override this to optimize multi-model training.
+ *
+ * @param dataset input dataset
+ * @param paramMaps An array of parameter maps.
+ * These values override any specified in this Estimator's embedded ParamMap.
+ * @return An iterator which produces one model per call to `next`. The models may be produced in
+ * a different order than the order of the parameters in paramMap. The next method of
+ * the iterator will return a tuple of the form `(index, model)` where model corresponds
+ * to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
+ * to `next` should always produce unique values of `index`.
+ *
+ * :: Experimental ::
+ */
+ @Experimental
+ @Since("2.3.0")
+ def fitMultiple(
+ dataset: Dataset[_],
+ paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
+
+ val numModel = paramMaps.length
+ val counter = new AtomicInteger(0)
+ new JIterator[(Integer, M)] {
+ def next(): (Integer, M) = {
+ val index = counter.getAndIncrement()
+ if (index < numModel) {
+ (index, fit(dataset, paramMaps(index)))
+ } else {
+ counter.set(numModel)
--- End diff --
This `counter.set` seems to be meaningless.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85489/testReport)** for PR 20095 at commit [`cf89704`](https://github.com/apache/spark/commit/cf89704252e2436ee5333db7fdf6d55ffc094634).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85489/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r159011656
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
+
+ /**
+ * Fits multiple models to the input data with multiple sets of parameters. The default
+ * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
+ * could override this to optimize multi-model training.
+ *
+ * @param dataset input dataset
+ * @param paramMaps An array of parameter maps.
+ * These values override any specified in this Estimator's embedded ParamMap.
+ * @return An iterator which produces one model per call to `next`. The models may be produced in
+ * a different order than the order of the parameters in paramMap. The next method of
+ * the iterator will return a tuple of the form `(index, model)` where model corresponds
+ * to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
+ * to `next` should always produce unique values of `index`.
+ *
+ * :: Experimental ::
+ */
+ @Experimental
+ @Since("2.3.0")
+ def fitMultiple(
+ dataset: Dataset[_],
+ paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
+
+ val numModel = paramMaps.length
+ val counter = new AtomicInteger(0)
+ new JIterator[(Integer, M)] {
+ def next(): (Integer, M) = {
+ val index = counter.getAndIncrement()
+ if (index < numModel) {
+ (index, fit(dataset, paramMaps(index)))
+ } else {
+ counter.set(numModel)
--- End diff --
I've updated this part to use `compareAndSet` because it's more correct and it makes the intentions here clear :).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85441/testReport)** for PR 20095 at commit [`1992931`](https://github.com/apache/spark/commit/1992931d08b65046d63090760297cc7d41e7373a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85488/testReport)** for PR 20095 at commit [`aab321b`](https://github.com/apache/spark/commit/aab321bccdadbb2255530ab8d594554ae30923d5).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85572/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85571/testReport)** for PR 20095 at commit [`4d10c19`](https://github.com/apache/spark/commit/4d10c19e004e6f7b265f7028accbda6d83af8ae4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85441/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85572/testReport)** for PR 20095 at commit [`8c7c8e3`](https://github.com/apache/spark/commit/8c7c8e3e14c6c2d7f64c5a3f54c129e794176672).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85488/testReport)** for PR 20095 at commit [`aab321b`](https://github.com/apache/spark/commit/aab321bccdadbb2255530ab8d594554ae30923d5).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r158929523
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
--- End diff --
style: blank `.map {...`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85572/testReport)** for PR 20095 at commit [`8c7c8e3`](https://github.com/apache/spark/commit/8c7c8e3e14c6c2d7f64c5a3f54c129e794176672).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85489/testReport)** for PR 20095 at commit [`cf89704`](https://github.com/apache/spark/commit/cf89704252e2436ee5333db7fdf6d55ffc094634).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85487/testReport)** for PR 20095 at commit [`0b85a60`](https://github.com/apache/spark/commit/0b85a603f5ce12170f4fcda62ffc6280ad70caab).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2879/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85487/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85488/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20095
**[Test build #85487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85487/testReport)** for PR 20095 at commit [`0b85a60`](https://github.com/apache/spark/commit/0b85a603f5ce12170f4fcda62ffc6280ad70caab).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r159127966
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map { _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
--- End diff --
style: indentation
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20095
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20095: [SPARK-22126][ML] Added fitMultiple method with default ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20095
ping @MrBago
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...
Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on a diff in the pull request:
https://github.com/apache/spark/pull/20095#discussion_r159007817
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,51 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage {
*/
@Since("2.0.0")
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
- paramMaps.map(fit(dataset, _))
+ val modelIter = fitMultiple(dataset, paramMaps)
+ val models = paramMaps.map{ _ =>
+ val (index, model) = modelIter.next()
+ (index.toInt, model)
+ }
+ paramMaps.indices.map(models.toMap)
+}
+
+ /**
+ * Fits multiple models to the input data with multiple sets of parameters. The default
+ * implementation calls `fit` once for each call to the iterator's `next` method. Subclasses
+ * could override this to optimize multi-model training.
+ *
+ * @param dataset input dataset
+ * @param paramMaps An array of parameter maps.
+ * These values override any specified in this Estimator's embedded ParamMap.
+ * @return An iterator which produces one model per call to `next`. The models may be produced in
+ * a different order than the order of the parameters in paramMap. The next method of
+ * the iterator will return a tuple of the form `(index, model)` where model corresponds
+ * to `paramMaps(index)`. This Iterator should be thread safe, meaning concurrent calls
+ * to `next` should always produce unique values of `index`.
+ *
+ * :: Experimental ::
+ */
+ @Experimental
+ @Since("2.3.0")
+ def fitMultiple(
+ dataset: Dataset[_],
+ paramMaps: Array[ParamMap]): JIterator[(Integer, M)] = {
+
+ val numModel = paramMaps.length
+ val counter = new AtomicInteger(0)
+ new JIterator[(Integer, M)] {
+ def next(): (Integer, M) = {
+ val index = counter.getAndIncrement()
+ if (index < numModel) {
+ (index, fit(dataset, paramMaps(index)))
+ } else {
+ counter.set(numModel)
--- End diff --
I don't think it's very important but I'm setting the counter here because on the call to `val index = counter.getAndIncrement()` the counter gets incremented even if we throw an exception. Eventually the counter "could overflow" ..., I don't think it's a major concern but that's why the set is here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org