You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2017/12/22 21:55:45 UTC

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

GitHub user MrBago opened a pull request:

    https://github.com/apache/spark/pull/20058

    [SPARK-22126][ML][PySpark] Pyspark portion of the fit-multiple API

    ## What changes were proposed in this pull request?
    
    Adding fitMultiple API to `Estimator` with default implementation. Also update have ml.tuning meta-estimators use this API.
    
    ## How was this patch tested?
    
    Unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MrBago/spark python-fitMultiple

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20058.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20058
    
----
commit 15e9c338f5869c519ab13f6612fb89bc8c200bdb
Author: Bago Amirbekian <ba...@...>
Date:   2017-12-18T18:52:16Z

    Add fitMultiple method to Estimator.

commit fdef9d5bfc75b26200f6ff3c78fee044bc64cb66
Author: Bago Amirbekian <ba...@...>
Date:   2017-12-19T00:45:42Z

    A test, some docs, and python2/3 compatibility.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    reviewing now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85443/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159004397
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +74,24 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A list/tuple of param maps.
    --- End diff --
    
    I changed the docstring to `Sequence` instead of `list/tuple`, is that ok? Do you want to explicitly restrict the input to be a list or tuple?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158882154
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +74,24 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A list/tuple of param maps.
    +        :return: A thread safe iterable which contains one model for each param map. Each
    +                 call to `next(modelIterator)` will return `(index, model)` where model was fit
    +                 using `params[index]`. Params maps may be fit in an order different than their
    +                 order in params.
    +
    +        .. note:: Experimental
    +        """
    +        def fitSingleModel(index):
    +            return self.fit(dataset, params[index])
    --- End diff --
    
    Shall we make a copy of the Estimator before defining fitSingleModel to be extra safe (in case some other thread modifies the Params in this Estimator before a call to fit()?  You can do ```self.copy()``` beforehand to get a copy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85529/testReport)** for PR 20058 at commit [`c44db97`](https://github.com/apache/spark/commit/c44db979e83c0c2e49df84e9e3e0fd375748d7e3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85319/testReport)** for PR 20058 at commit [`49c8332`](https://github.com/apache/spark/commit/49c833214e234144f6606e4f8bd53e320a268150).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85483/testReport)** for PR 20058 at commit [`fe3d6bd`](https://github.com/apache/spark/commit/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159004318
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -18,13 +18,40 @@
     from abc import ABCMeta, abstractmethod
     
     import copy
    +import threading
     
     from pyspark import since
    -from pyspark.ml.param import Params
     from pyspark.ml.param.shared import *
     from pyspark.ml.common import inherit_doc
     from pyspark.sql.functions import udf
    -from pyspark.sql.types import StructField, StructType, DoubleType
    +from pyspark.sql.types import StructField, StructType
    +
    +
    +class FitMutlipleIterator(object):
    --- End diff --
    
    I'm open to this, but I didn't initially do it this way because I've been bit by nested classes in python before. There are subtle issues with nested classes in python. The one that comes to mind is serialization (which isn't an issue here) but that's not the only one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159023958
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A Sequence of param maps.
    +        :return: A thread safe iterable which contains one model for each param map. Each
    +                 call to `next(modelIterator)` will return `(index, model)` where model was fit
    +                 using `params[index]`. Params maps may be fit in an order different than their
    +                 order in params.
    +
    +        .. note:: DeveloperApi
    +        .. note:: Experimental
    +        """
    +        estimator = self.copy()
    +
    +        def fitSingleModel(index):
    +            return estimator.fit(dataset, params[index])
    +
    +        return FitMultipleIterator(fitSingleModel, len(params))
    --- End diff --
    
    The idea is you should be able to do something like this:
    
    ```
    pool = ...
    modelIter = estimator.fitMultiple(params)
    rng = range(len(params))
    for index, model in pool.imap_unordered(lambda _: next(modelIter), rng):
        pass
    ```
    That's pretty much how I've set up corss validator to use it, https://github.com/apache/spark/pull/20058/files/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3#diff-cbc8c36bfdd245e4e4d5bd27f9b95359R292
    
    The reason for set it up this way is so that, when appropriate, Estimators can implement their own optimized `fitMultiple` methods that just need to return an "iterator", A class with `__iter__` and `__next__`. For examples models that use `maxIter` and `maxDepth` params.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85481/testReport)** for PR 20058 at commit [`209278d`](https://github.com/apache/spark/commit/209278dceff39634bfba5b930e928a0e0efe092e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20058


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159021231
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +74,24 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A list/tuple of param maps.
    --- End diff --
    
    Is there another Sequence type this could be other than list or tuple?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158881984
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +74,24 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A list/tuple of param maps.
    +        :return: A thread safe iterable which contains one model for each param map. Each
    +                 call to `next(modelIterator)` will return `(index, model)` where model was fit
    +                 using `params[index]`. Params maps may be fit in an order different than their
    +                 order in params.
    +
    +        .. note:: Experimental
    --- End diff --
    
    Let's use ```.. note:: DeveloperApi``` too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159105682
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    --- End diff --
    
    I made this change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159028207
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    --- End diff --
    
    That's a good point that we could rename "params" to be clearer.  How about "paramMaps"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85318/testReport)** for PR 20058 at commit [`fdef9d5`](https://github.com/apache/spark/commit/fdef9d5bfc75b26200f6ff3c78fee044bc64cb66).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159021297
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    --- End diff --
    
    Check out the discussion on the JIRA and the linked design doc.  Basically, we need the same argument types but different return types from what the current fit() method provides.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159020468
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A Sequence of param maps.
    +        :return: A thread safe iterable which contains one model for each param map. Each
    +                 call to `next(modelIterator)` will return `(index, model)` where model was fit
    +                 using `params[index]`. Params maps may be fit in an order different than their
    +                 order in params.
    +
    +        .. note:: DeveloperApi
    +        .. note:: Experimental
    +        """
    +        estimator = self.copy()
    +
    +        def fitSingleModel(index):
    +            return estimator.fit(dataset, params[index])
    +
    +        return FitMultipleIterator(fitSingleModel, len(params))
    --- End diff --
    
    So whats the benefit of `FitMultipleIterator` v.s. using `imap_unordered`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85529/testReport)** for PR 20058 at commit [`c44db97`](https://github.com/apache/spark/commit/c44db979e83c0c2e49df84e9e3e0fd375748d7e3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class _FitMultipleIterator(object):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85443 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85443/testReport)** for PR 20058 at commit [`d73af1f`](https://github.com/apache/spark/commit/d73af1ffb409b35788be927fe56dd8aaf03b4f10).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85319/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85443/testReport)** for PR 20058 at commit [`d73af1f`](https://github.com/apache/spark/commit/d73af1ffb409b35788be927fe56dd8aaf03b4f10).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85318/testReport)** for PR 20058 at commit [`fdef9d5`](https://github.com/apache/spark/commit/fdef9d5bfc75b26200f6ff3c78fee044bc64cb66).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158882354
  
    --- Diff: python/pyspark/ml/tuning.py ---
    @@ -31,6 +31,17 @@
                'TrainValidationSplitModel']
     
     
    +def parallelFitTasks(est, train, eva, validation, epm):
    --- End diff --
    
    How about a brief doc string?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158881392
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -18,13 +18,40 @@
     from abc import ABCMeta, abstractmethod
     
     import copy
    +import threading
     
     from pyspark import since
    -from pyspark.ml.param import Params
     from pyspark.ml.param.shared import *
     from pyspark.ml.common import inherit_doc
     from pyspark.sql.functions import udf
    -from pyspark.sql.types import StructField, StructType, DoubleType
    +from pyspark.sql.types import StructField, StructType
    +
    +
    +class FitMutlipleIterator(object):
    +    """
    +    Used by default implementation of Estimator.fitMultiple to produce models in a thread safe
    +    iterator.
    --- End diff --
    
    It'd be nice to document what fitSingleModel should do, plus what the iterator returns.
    
    nit: How about renaming numModel -> numModels ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85481/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the fit-mu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    LGTM
    I'll merge this b/c of the time pressure for 2.3, but @holdenk  please follow up if you have more comments on this.
    Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158882626
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +74,24 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    +        """
    +        Fits a model to the input dataset for each param map in params.
    +
    +        :param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`.
    +        :param params: A list/tuple of param maps.
    --- End diff --
    
    Let's explicitly check that this is a list or tuple and throw a good error message if not.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158880831
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -2359,6 +2359,21 @@ def test_unary_transformer_transform(self):
                 self.assertEqual(res.input + shiftVal, res.output)
     
     
    +class TestFit(unittest.TestCase):
    --- End diff --
    
    nit: How about EstimatorTest since this is testing part of the Estimator API?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85529/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159020312
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    --- End diff --
    
    So in Scala Spark we use the `fit` function rather than separate functions. Also the `params` name is different than the Scala one. Any reason for the difference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159024163
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -47,6 +86,28 @@ def _fit(self, dataset):
             """
             raise NotImplementedError()
     
    +    @since("2.3.0")
    +    def fitMultiple(self, dataset, params):
    --- End diff --
    
    We couldn't use `fit` because it's going to have the same signature as the existing `fit` method but return a different type, (Iterator[(Int, Model)] instead of Seq[Model]). I was trying to be consistent with Estimator.fit which uses the name `params` which is different than the name of the same argument in Scala :/. Happy to change it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r159105728
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -18,13 +18,40 @@
     from abc import ABCMeta, abstractmethod
     
     import copy
    +import threading
     
     from pyspark import since
    -from pyspark.ml.param import Params
     from pyspark.ml.param.shared import *
     from pyspark.ml.common import inherit_doc
     from pyspark.sql.functions import udf
    -from pyspark.sql.types import StructField, StructType, DoubleType
    +from pyspark.sql.types import StructField, StructType
    +
    +
    +class FitMutlipleIterator(object):
    --- End diff --
    
    @jkbradley @WeichenXu123 I made `FitMultipleIterator` a private class, is that good enough or should I make it internal to the `fitMultiple` method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158881199
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -18,13 +18,40 @@
     from abc import ABCMeta, abstractmethod
     
     import copy
    +import threading
     
     from pyspark import since
    -from pyspark.ml.param import Params
     from pyspark.ml.param.shared import *
     from pyspark.ml.common import inherit_doc
     from pyspark.sql.functions import udf
    -from pyspark.sql.types import StructField, StructType, DoubleType
    +from pyspark.sql.types import StructField, StructType
    +
    +
    +class FitMutlipleIterator(object):
    --- End diff --
    
    typo: Mutliple -> Multiple


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85318/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20058#discussion_r158932419
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -18,13 +18,40 @@
     from abc import ABCMeta, abstractmethod
     
     import copy
    +import threading
     
     from pyspark import since
    -from pyspark.ml.param import Params
     from pyspark.ml.param.shared import *
     from pyspark.ml.common import inherit_doc
     from pyspark.sql.functions import udf
    -from pyspark.sql.types import StructField, StructType, DoubleType
    +from pyspark.sql.types import StructField, StructType
    +
    +
    +class FitMutlipleIterator(object):
    --- End diff --
    
    What about change this `FitMutlipleIterator` class to be an inner class in default implementation method `fitMultiple` ? I think put it outside will be no other usage.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85483/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85483/testReport)** for PR 20058 at commit [`fe3d6bd`](https://github.com/apache/spark/commit/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FitMultipleIterator(object):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85319/testReport)** for PR 20058 at commit [`49c8332`](https://github.com/apache/spark/commit/49c833214e234144f6606e4f8bd53e320a268150).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the fit-mu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20058
  
    **[Test build #85481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85481/testReport)** for PR 20058 at commit [`209278d`](https://github.com/apache/spark/commit/209278dceff39634bfba5b930e928a0e0efe092e).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FitMultipleIterator(object):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org