You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by evanyc15 <gi...@git.apache.org> on 2015/12/11 23:45:38 UTC
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
GitHub user evanyc15 opened a pull request:
https://github.com/apache/spark/pull/10270
[SPARK-10931][PYSPARK][ML] PySpark ML Models should contain Param values
PySpark spark.ml Models are generally wrappers around Java objects and do not even contain Param values. This JIRA is for copying the Param values from the Estimator to the model.
This can likely be solved by modifying Estimator.fit to copy Param values, but should also include proper unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/evanyc15/spark SPARK-10931-pyspark-mllib
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10270.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10270
----
commit 53062d1edc08bf89b7cdb46969c182aa0f26dbe4
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-11-19T03:54:57Z
Copied parameters over from Estimator to Transformer
commit f0b124a1f67037f854d1e7891091ba4d1cdcecc8
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-11-24T00:44:53Z
Estimator UID is being copied correctly to the Transformer model objects and params now, working on Doctests
commit 1c5a791775f7f078b3a488c5ea88beed29c2a8d7
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-11-25T00:16:32Z
Changed the way parameters are copied from the Estimator to Transformer
commit 332cc670b61c5bd19cb5cea705a307440fc92868
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-12-01T22:51:24Z
Checkpoint, switching back to inheritance method
commit 07fbbfd91692ecb61b0e8659ee296dfaf3150f13
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-12-02T00:54:41Z
Working on DocTests
commit d86e1dfb33aadfae3a151edf0ceaa6593cfa074e
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-12-03T02:07:05Z
Implemented Doctests for Recommendation, Clustering, Classification (except RandomForestClassifier), Evaluation, Tuning, Regression (except RandomRegression)
commit a5902cfc6622eb4c6c5d83a489f6693b08f04518
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-12-04T23:20:42Z
Ready for Code Review
commit 24dd45a30b75c9b7e33edf37993b2277f5cbe606
Author: Evan Chen <ch...@us.ibm.com>
Date: 2015-12-11T01:35:40Z
Code Review changeset #1
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-206541387
Build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-164570713
Hey all,
I've resolved the merge conflicts.
Thanks,
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-205972464
add to whitelist
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #10270: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 closed the pull request at:
https://github.com/apache/spark/pull/10270
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-206539750
**[Test build #55147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55147/consoleFull)** for PR 10270 at commit [`b4890cb`](https://github.com/apache/spark/commit/b4890cb0980688f006662895760d422ba4f619fb).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-217942891
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-175302328
This approach only copies shared params like `HasFeatureCol` through class inheritance. Parameters defined in the actual model itself are not copied, so this patch won't mimic the scala side exactly. For example, the `NaiveBayes` smoothing parameter is not copied to the Naive Bayes model. It seems like we should copy all params from estimator to model, not just shared ones. Doing so likely requires manually copying those parameters through the model constructor or similar approach. The other approach proposed by @jkbradley in [the jira](https://issues.apache.org/jira/browse/SPARK-10931) is to override `getattr` but has the problem of not generating docs.
I like the latter approach but the lack of doc generation may be a deal breaker. cc @holdenk
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #10270: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the issue:
https://github.com/apache/spark/pull/10270
Updated branch to the newest Spark changes
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-215519160
OK, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-212652093
It will be great to get this into 2.0. I still haven't found time to test solutions which avoid copying a lot of code, but I hope to soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #10270: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the issue:
https://github.com/apache/spark/pull/10270
Yeah, I'll take a look. Thanks
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-164070628
Just a heads up that this has merge conflicts that need to be resolved so jenkins can the tests on it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-206541367
**[Test build #55147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55147/consoleFull)** for PR 10270 at commit [`b4890cb`](https://github.com/apache/spark/commit/b4890cb0980688f006662895760d422ba4f619fb).
* This patch **fails R style tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #10270: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the issue:
https://github.com/apache/spark/pull/10270
New Pull Request #14653 created for this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-214882026
@evanyc15 Let me know if you'd like help getting this PR merged. I know you waited a long time. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-218025352
CC @MLnick. Can you please review my code?
Thank you
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-217027553
Hey Joseph,
Just pushed in the new changes.
Thank you
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-164070046
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-218984076
@evanyc15 this won't make 2.0, so I'll take a look after 2.0 release. By the way, you need to rebase to master in order to avoid pulling in all the commits in the PR
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-206538968
Jenkins test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by evanyc15 <gi...@git.apache.org>.
Github user evanyc15 commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-215257517
Hey Joseph,
I'll work on getting the merge conflicts resolved.
Thank you
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #10270: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/10270
@evanyc15 the PR looks really messed up now. Can you fix it, or close it and open a new one please? thanks
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-206541390
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55147/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-175306040
@jkbradley probably is the best to check with, I'm not sure how important the pydoc generation would be for the models. The approach often taken in the scala code is to make a shared Params class that both the model and the estimator inherit which seems like it might offer some nice tradeoffs (we don't have to duplicate the list of params everywhere, custom params are able to be copied, and doc gen works) but it does add some extra bit of code per-model so if we don't care about the pydoc that much it might not be a good solution
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-203673687
Sorry for the slow response on this. I definitely would want the doc available. This will depend on the decision made on [https://issues.apache.org/jira/browse/SPARK-14033], so I'll hold off on a detailed review. But in the meantime, I'll play around with possible solutions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10931][PYSPARK][ML] PySpark ML Models s...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10270#issuecomment-205408470
We decided not to do SPARK-14033, so this can proceed. I'll try a few possibilities, but let me know if you have updates!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org