You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by willb <gi...@git.apache.org> on 2014/07/07 23:12:42 UTC
[GitHub] spark pull request: spark-729: predictable closure capture
GitHub user willb opened a pull request:
https://github.com/apache/spark/pull/1322
spark-729: predictable closure capture
SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying `ClosureCleaner` to serialize and deserialize its argument.
This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0. In particular, it ensures that the cloned `Broadcast` objects produced by closure capture are registered with `ContextCleaner` so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away.
(See #189 for additional discussion and background.)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/willb/spark spark-729
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1322.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1322
----
commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton <wi...@redhat.com>
Date: 2014-07-04T16:26:59Z
Added reference counting for Broadcasts.
commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton <wi...@redhat.com>
Date: 2014-07-07T17:52:26Z
Added ContextCleaner.withCurrentCleaner
This method allows code that needs access to the currently-active
ContextCleaner to access it via a DynamicVariable.
commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton <wi...@redhat.com>
Date: 2014-03-18T14:55:57Z
Added tests for variable capture in closures
The two tests added to ClosureCleanerSuite ensure that variable values
are captured at RDD definition time, not at job-execution time.
commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton <wi...@redhat.com>
Date: 2014-03-20T15:48:17Z
Predictable closure environment capture
The environments of serializable closures are now captured as
part of closure cleaning. Since we already proactively check most
closures for serializability, ClosureCleaner.clean now returns
the result of deserializing the serialized version of the cleaned
closure.
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton <wi...@redhat.com>
Date: 2014-03-26T04:45:45Z
Skip proactive closure capture for runJob
There are two possible cases for runJob calls: either they are called
by RDD action methods from inside Spark or they are called from client
code. There's no need to proactively check the closure argument to
runJob for serializability or force variable capture in either case:
1. if they are called by RDD actions, their closure arguments consist
of mapping an already-serializable closure (with an already-frozen
environment) to each element in the RDD;
2. in both cases, the closure is about to execute and thus the benefit
of proactively checking for serializability (or ensuring immediate
variable capture) is nonexistent.
(Note that ensuring capture via serializability on closure arguments to
runJob also causes pyspark accumulators to fail to update.)
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton <wi...@redhat.com>
Date: 2014-03-26T16:31:56Z
Split closure-serializability failure tests
This splits the test identifying expected failures due to
closure serializability into three cases.
commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton <wi...@redhat.com>
Date: 2014-04-04T21:39:55Z
Fixed style issues in tests
Conflicts:
core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala
commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton <wi...@redhat.com>
Date: 2014-04-04T22:15:50Z
Stylistic changes and cleanups
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton <wi...@redhat.com>
Date: 2014-05-02T14:23:49Z
Removed proactive closure serialization from DStream
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton <wi...@redhat.com>
Date: 2014-07-07T20:47:41Z
Support tracking clones of broadcast variables.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-53515447
BTW @willb, if this is not ready, do you mind closing the PR and resending when it is? We'd like to minimize the number of open PRs that aren't actively being reviewed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49913436
@mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down. Coincidentally, it's what I was switching context back to when I saw this comment.
@rxin, can you point me to the broadcast change so I can track it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-53948235
Alright, feel free to describe this on the JIRA too if you'd like input.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-48250514
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16383/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by willb <gi...@git.apache.org>.
Github user willb closed the pull request at:
https://github.com/apache/spark/pull/1322
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49913556
QA tests have started for PR 1322. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17050/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49925954
Matei was referring to https://github.com/apache/spark/pull/1498
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-48250512
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-48242207
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49949418
Alright, just ping me (with @mateiz) when you think it's ready.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-53903366
@mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it. I'll reopen when I have a solution.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49913042
Jenkins, add to whitelist and test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-48242186
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49926246
Thanks, @rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: spark-729: predictable closure capture
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1322#issuecomment-49913101
@rxin you may want to look at this with your broadcast change
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---