You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2016/04/30 00:23:29 UTC

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/12790

    [SPARK-15018][PYSPARK][ML] Fixed bug causing error if PySpark Pipeline used without stages

    ## What changes were proposed in this pull request?
    
    When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead a NoneType error is thrown because the `stage` param does not get set to a default empty list.  This change fixes this error by setting and getting the param as a default and properly changing the kwargs to an empty list if `None` is passed in. 
    
    
    ## How was this patch tested?
    Added new unit tests to verify an empty Pipeline acts as an identity transformer
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark pipeline-identity-SPARK-15018

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12790.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12790
    
----
commit e26de7eaf699951dd645922156b0db6c044086cf
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-04-29T22:15:19Z

    Fixed bug causing error if PySpark Pipeline used without stages

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r63727210
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    The logic behind this code
    ```
    if stages is None:
                stages = []
    ```
    is we can not make mutable values as default arguments (you can refer #8110).
    So the cause of this bug is we should pass ```[]``` rather than ```None``` to ```self._paramMap[self.stages]```.
    I have run the following code:
    ```
    >>> p = Pipeline()
    >>> p.fit(training)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/yliang/data/trunk1/spark/python/pyspark/ml/base.py", line 64, in fit
        return self._fit(dataset)
      File "/Users/yliang/data/trunk1/spark/python/pyspark/ml/pipeline.py", line 98, in _fit
        for stage in stages:
    TypeError: 'NoneType' object is not iterable
    >>> p.getStages()
    >>> p.setStages([])
    Pipeline_40bc8371599065c2f33d
    >>> p.getStages()
    []
    >>> p.fit(training)
    PipelineModel_48a49d414b2eac1eac36
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62377335
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    maybe this would be better?
    ```python
    kwargs[Pipeline.stages.name] = []
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Improve handling of PySpark P...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    @yanboliang, I updated titles and descriptions here and in the JIRA.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63239/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75023876
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -230,6 +230,15 @@ def test_pipeline(self):
             self.assertEqual(5, transformer3.dataset_index)
             self.assertEqual(6, dataset.index)
     
    +    def test_identity_pipeline(self):
    +        dataset = MockDataset()
    +
    +        def doTransform(pipeline):
    +            pipeline_model = pipeline.fit(dataset)
    +            return pipeline_model.transform(dataset)
    +        self.assertEqual(dataset.index, doTransform(Pipeline()).index)
    +        self.assertEqual(dataset.index, doTransform(Pipeline(stages=[])).index)
    --- End diff --
    
    Should we also check that `setParams(stages=[])` and `Pipeline().getStages()` return the expected value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-215899867
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57373/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75229691
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -230,6 +230,15 @@ def test_pipeline(self):
             self.assertEqual(5, transformer3.dataset_index)
             self.assertEqual(6, dataset.index)
     
    +    def test_identity_pipeline(self):
    +        dataset = MockDataset()
    +
    +        def doTransform(pipeline):
    +            pipeline_model = pipeline.fit(dataset)
    +            return pipeline_model.transform(dataset)
    +        self.assertEqual(dataset.index, doTransform(Pipeline()).index)
    +        self.assertEqual(dataset.index, doTransform(Pipeline(stages=[])).index)
    --- End diff --
    
    Maybe, but these are no different than any other param and we usually don't test that we can get/set each one.  I think it's safe to assume that `setParams` will set the given value and `getStages` will return the value set or default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64018/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #64009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64009/consoleFull)** for PR 12790 at commit [`43b08b6`](https://github.com/apache/spark/commit/43b08b60f0a22f5e2b6c0058a3ab13eaf5869022).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #60962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60962/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63980/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-215897263
  
    cc @jkbradley @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #63980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63980/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #63239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63239/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r63561746
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    ping @MLnick , have any thoughts about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75229100
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -57,9 +57,8 @@ def __init__(self, stages=None):
             """
             __init__(self, stages=None)
             """
    -        if stages is None:
    -            stages = []
             super(Pipeline, self).__init__()
    +        self._setDefault(stages=[])
    --- End diff --
    
    Sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62393201
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    Well if you actually call `setParams` with `stages=None` like the function shows as the default, it seems like it should work, but fails when `_fit` is called.  Maybe people would know not to do this and it's not really a problem?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75390810
  
    --- Diff: python/pyspark/status.py ---
    @@ -83,6 +85,8 @@ def getJobInfo(self, jobId):
             job = self._jtracker.getJobInfo(jobId)
             if job is not None:
                 return SparkJobInfo(jobId, job.stageIds(), str(job.status()))
    +        else:
    --- End diff --
    
    Python returns None by default ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Yeah, there is no default value in the Scala side.  So I guess the thing to do here is to just give a better error message in Python when the param is not set and document the behavior a little better in the pydocs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60962/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Bumping this, there has been some discussion but hopefully we can get a consensus on constructing a pipeline with no stages (should be an identity trans).  Looking at the current API, it seems like the following should be supported:
    
    1) `pl = Pipeline(stages=[])`
    2) `pl = Pipeline(stages=None)`
    3) `pl = Pipeline()`
    
    However, only 1) works, while the others will lead to a cryptic error when calling `fit`.  I think 1) & 3) should be supported, I am indifferent about 2).  Regardless of what's decided, this code should be cleaned up because the current checks don't do anything, there is no default value, and `getStages` is missing a return path.
    
    cc @mengxr @jkbradley @MLnick 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Improve handling of Py...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12790


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    @yanboliang and @MechCoder , I updated this to remove the default value for the `stages` param to be consistent with Scala and also made the pydocs a little more clear that the param needs to be a list and could be empty.
    
    Now if `getStages()` is called and the param is not set, it will result in a `KeyError` and indicate the `stages` param, instead of just giving "TypeError: 'NoneType' object is not iterable" as before.  This is consistent with other PySpark params also.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #60962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60962/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-220066125
  
    The logic behind this code
    ```
    if stages is None:
                stages = []
    ```
    is we can not make mutable values as default arguments (you can refer #8110).
    So the cause of this bug is we should pass ```[]``` rather than ```None``` to ```self._paramMap[self.stages]```.
    I have run the following code:
    ```
    >>> p = Pipeline()
    >>> p.fit(training)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/yliang/data/trunk1/spark/python/pyspark/ml/base.py", line 64, in fit
        return self._fit(dataset)
      File "/Users/yliang/data/trunk1/spark/python/pyspark/ml/pipeline.py", line 98, in _fit
        for stage in stages:
    TypeError: 'NoneType' object is not iterable
    >>> p.getStages()
    >>> p.setStages([])
    Pipeline_40bc8371599065c2f33d
    >>> p.getStages()
    []
    >>> p.fit(training)
    PipelineModel_48a49d414b2eac1eac36
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-215899798
  
    **[Test build #57373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57373/consoleFull)** for PR 12790 at commit [`e26de7e`](https://github.com/apache/spark/commit/e26de7eaf699951dd645922156b0db6c044086cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-220106834
  
    Thanks @yanboliang .  I just want to point out that we probably don't want to use the logic 
    `if stages is None:` because that might end up setting stages to the default unintentionally.  For example, if another param 'foo' was added to `Pipeline` so that now setParams looks like this
    ```python
    def setParams(stages=None, foo=bar):
        ...
    ```
    then calling it like this `p.setParams(foo=baz)` would also cause that logic to be true and then set `stages` to the default even though the user just wanted to set `foo`
    
    Also, to clarify what is happening when the an empty `Pipeline` is constructed and then `fit` like this
    ```python
    >>> p = Pipeline()
    >>> p.fit(training)
    ```
    the code 
    ```python
    if stages is None:
                stages = []
    ```
    actually does nothing since it is just a local variable and does not modify `kwargs`, so the `stages` param never even gets set.  Then if `getStages()` is called, there is no return path since `stages` was never added to `_paramMap`, which is why `fit()` fails when trying to iterate over a `NoneType`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #63980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63980/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64009/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #63239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63239/consoleFull)** for PR 12790 at commit [`e1df580`](https://github.com/apache/spark/commit/e1df580a3e96bf791d78e911b124e180536d2937).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #64009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64009/consoleFull)** for PR 12790 at commit [`43b08b6`](https://github.com/apache/spark/commit/43b08b60f0a22f5e2b6c0058a3ab13eaf5869022).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    @BryanCutler I found when fitting a Scala Pipeline with no stages,
    ```
    val pipeline = new Pipeline()
    val model = pipeline.fit(df)
    ```
    it throw exceptions:
    ```
    Failed to find a default value for stages
    java.util.NoSuchElementException: Failed to find a default value for stages
    	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:660)
    	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:660)
    	at scala.Option.getOrElse(Option.scala:121)
    	at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:659)
    	at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
    	at org.apache.spark.ml.param.Params$class.$(params.scala:664)
    	at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
    	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:177)
    	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
    	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
    ```
    I'm more prefer to make Scala and Python have consistent behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Actually the behavior in Scala is consistent - passing in `Array()` or `Array.empty()` you are passing in `Array[Nothing]` so that is why there is a `ClassCastException`.
    
    Passing in an empty array of the correct type works as an identity transformer, as documented:
    
    ```
    scala> df.show
    +---+----+
    | id|text|
    +---+----+
    |  1| foo|
    |  2| bar|
    |  3| baz|
    +---+----+
    
    scala> val p = new Pipeline()
    p: org.apache.spark.ml.Pipeline = pipeline_0778ab6ded4f
    
    scala> p.setStages(Array[PipelineStage]())
    res20: p.type = pipeline_0778ab6ded4f
    
    scala> p.fit(df).transform(df).show
    +---+----+
    | id|text|
    +---+----+
    |  1| foo|
    |  2| bar|
    |  3| baz|
    +---+----+
    ```
    
    Personally, I find it weird to allow empty stages - what use is an identity transformer? I'd prefer to throw an error during `setStages` if it's empty. But perhaps not worth changing ...
    
    If we don't make that change, we could make it on the Scala side that if an array is passed that is empty, set stages to a new `Array[PipelineStage]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Thanks for the review @MechCoder!  Do you have any preference for what should be supported from this comment (above) https://github.com/apache/spark/pull/12790#issuecomment-227580800)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    If that's the case then the piece of documentation that promises the Pipeline to behave as an identity transformer when no stages are used, has to be changed (removed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    LGTM: cc @yanboliang @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75024959
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -57,9 +57,8 @@ def __init__(self, stages=None):
             """
             __init__(self, stages=None)
             """
    -        if stages is None:
    -            stages = []
             super(Pipeline, self).__init__()
    +        self._setDefault(stages=[])
    --- End diff --
    
    Could you add a comment on why this is being done for future reference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #64018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64018/consoleFull)** for PR 12790 at commit [`6bee0a6`](https://github.com/apache/spark/commit/6bee0a6f95e60854f692cda9b69494cc1bb72c38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62543286
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    Yah I guess since it was previously being checked it might be good to check (although switching it to only do this when explicitly set to none as your editted comment does probably makes sense). Maybe @sethah and or @davies / @yanboliang can weigh in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-215899866
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-215897850
  
    **[Test build #57373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57373/consoleFull)** for PR 12790 at commit [`e26de7e`](https://github.com/apache/spark/commit/e26de7eaf699951dd645922156b0db6c044086cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Awesome! Thansk!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Improve handling of PySpark P...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    LGTM, merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    **[Test build #64018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64018/consoleFull)** for PR 12790 at commit [`6bee0a6`](https://github.com/apache/spark/commit/6bee0a6f95e60854f692cda9b69494cc1bb72c38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r75392014
  
    --- Diff: python/pyspark/status.py ---
    @@ -83,6 +85,8 @@ def getJobInfo(self, jobId):
             job = self._jtracker.getJobInfo(jobId)
             if job is not None:
                 return SparkJobInfo(jobId, job.stageIds(), str(job.status()))
    +        else:
    --- End diff --
    
    Yeah, I know...  Hmmm, this is from something else, not sure how it got mixed in.  Let me remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Improve handling of PySpark P...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Yes, I agree that allowing steps to be an empty Sequence or a list in a Pipeline is non-intuitive but I'm fine with allowing that corner case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/12790#issuecomment-220333799
  
    @BryanCutler Thanks for your clarifying, good point and your proposal makes sense. The only left issue is whether we can figure out a better way to handle ```setParams``` with more than one params but not all arguments are specified by users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62380680
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    Do we need this? Now that we've got `getOrDefault` inside of `getStages` it shouldn't be necessary provided we use the get function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62393609
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    So almost all of our things explode if you call setParams with something set to None - it gets turned into null on  the Java side and that tends to explode. We could avoid copying None params over but we might want to do that more generally than just here (and I'd probably do that in a separate PR since I'm not super sure that is the right thing to do or if we just expect people to not use None unless they really explicitly mean it).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    @BryanCutler @MechCoder The current fix of removing the default value for the ```stages``` param is OK for me. But we also should discuss the behavior of ```stages=[]``` which is inconsistent between Scala and Python. In the Scala side, if we set ```stages``` with ```Array.empty()``` or ```Array()```, it will throw exception rather than acting as a identity transformer.
    ```
    [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.ml.PipelineStage;
    java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.ml.PipelineStage;
    ```
    I think we should also make it consistent between Scala and Python. 
    Meanwhile, would you mind updating the title of the JIRA and PR, as well as the description, to reflect the current fix of this bug? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15018][PYSPARK][ML] Fixed bug causing e...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12790#discussion_r62540975
  
    --- Diff: python/pyspark/ml/pipeline.py ---
    @@ -89,9 +87,9 @@ def setParams(self, stages=None):
             setParams(self, stages=None)
             Sets params for Pipeline.
             """
    -        if stages is None:
    -            stages = []
             kwargs = self.setParams._input_kwargs
    +        if stages is None:
    +            kwargs['stages'] = []
    --- End diff --
    
    Yeah, I wouldn't just expect any Param to handle a value of `None`, but since it's in the function signature like that, it looks like it's a valid value.  Also, it seems like previously it was trying to check for this, so it might be good to keep.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    @MLnick Thanks for clarification. Then I think current fix is OK, since Scala is strongly typed, we can regard ```Array()``` or ```Array.empty``` as wrong input and throw exception. It's not necessary to have same semantics at Python side for this issue. After updating the title of the JIRA and PR, as well as the description, then this can be get in. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12790
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org