You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2016/03/23 01:25:33 UTC

[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/11906

    [SPARK-14087][PySpark][ML] [WIP] PySpark JavaModel Param ownership error

    When a PySpark model is created after fitting data, its UID is initialized to the parent estimator's value. Before this assignment, any params defined in the model are copied from the object to the class in Params._copy_params() and assigned a different parent UID. This causes PySpark to think the params are not owned by the model and can lead to a ValueError raised from Params._shouldOwn
    
    *** Still thinking of a possible test for this

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark pyspark-init-uid-SPARK-14087

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11906.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11906
    
----
commit 276dc77c7c35769c2b65fabc3fb83142d2765fd4
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-03-23T00:16:19Z

    [SPARK-14087] Allow JavaModel to set an initial UID for when it has a parent estimator

commit 518f2e54bb18455f5f63255bceb7bce607504732
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-03-23T00:20:21Z

    fix lint error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205410451
  
    I feel like JavaModel's init method should get a UID if one does not yet exist and pass that it its parent classes' init methods as needed.  Eventually, Python users might construct models by hand and want to specify UIDs, and this setup should support that better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205409129
  
    Early comment: Could you please add a unit test, and make sure the test fails before your patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205554193
  
    Yeah, with your suggestion I could proceed with SPARK-13967 and add the binary toggle to PySpark CountVectorizer.  I'll just need to fix up a few things and change the tests a little.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200465133
  
    **[Test build #53951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53951/consoleFull)** for PR 11906 at commit [`efdd882`](https://github.com/apache/spark/commit/efdd882187124d0df9339a72160344f0d91fb59f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200096268
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205538978
  
    Ohhh, I see.  I didn't read it carefully enough.  I think the real issue is that CountVectorizer needs the Param as well.  We need all Estimators to contain the Model Params so that users can configure the whole Pipeline/Estimator before running fit. I'll create a JIRA for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205545296
  
    Hmmm, I'm not sure if that would completely solve this.. The scala version of `CountVectorizerModel` allows to construct it from a "vocab" set, instead of fitting the estimator.  So when I tried that on the Python side, I needed to define an "inputCol" param.  After adding the `HasInputCol` mixin to the model class, this ownership problem happened.
    
    I think there still needs to be some way to override the default `JavaModel` UID before calling the parent `__init__()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200092703
  
    **[Test build #53853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53853/consoleFull)** for PR 11906 at commit [`518f2e5`](https://github.com/apache/spark/commit/518f2e54bb18455f5f63255bceb7bce607504732).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205575414
  
    Please do, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205555108
  
    Do you mind if I go ahead and open a JIRA to construct `CountVectorizerModel` from vocab in PySpark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200465342
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200096271
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53853/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #11906: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaMode...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler closed the pull request at:

    https://github.com/apache/spark/pull/11906


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200465348
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53951/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200096161
  
    **[Test build #53853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53853/consoleFull)** for PR 11906 at commit [`518f2e5`](https://github.com/apache/spark/commit/518f2e54bb18455f5f63255bceb7bce607504732).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205537342
  
    > How did you uncover this issue in the first place? The code you used then should be usable to create a unit test for this.
    
    I tried to explain in the JIRA, but I probably should have done so here also..
    For SPARK-13967, I was adding a binary flag `Param` to the `CountVectorizerModel` and when the model is created from the estimator, this problem comes up and an exception is raised.
    
    As far as I can tell, this is only a problem when a `Param` is defined in the PySpark model only and it is created by an JavaWrapper estimator.  I don't see that happening anywhere currently, so I haven't come up with a unit test.  I could maybe include a temporary version of the `CountVectorizerModel` that I was working on, just for testing purposes until I do a PR for SPARK-13967, would that be ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205554581
  
    OK thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-200458726
  
    **[Test build #53951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53951/consoleFull)** for PR 11906 at commit [`efdd882`](https://github.com/apache/spark/commit/efdd882187124d0df9339a72160344f0d91fb59f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #11906: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaModel Param...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/11906
  
    closing, no longer an issue now that resetUid() is called


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14087][PySpark][ML] [WIP] PySpark JavaM...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11906#issuecomment-205549810
  
    I see.  It sounds like my suggestion would let us add the binary toggle param.  However, to add a constructor for CountVectorizerModel taking a vocab, we would need to address this issue.  I agree we should support taking a vocab, but can we separate it out as a future task?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org