You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by willb <gi...@git.apache.org> on 2014/07/07 23:12:42 UTC

[GitHub] spark pull request: spark-729: predictable closure capture

GitHub user willb opened a pull request:

    https://github.com/apache/spark/pull/1322

    spark-729:  predictable closure capture

    SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying `ClosureCleaner` to serialize and deserialize its argument.
    
    This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0.  In particular, it ensures that the cloned `Broadcast` objects produced by closure capture are registered with `ContextCleaner` so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away.
    
    (See #189 for additional discussion and background.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark spark-729

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1322
    
----
commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton <wi...@redhat.com>
Date:   2014-07-04T16:26:59Z

    Added reference counting for Broadcasts.

commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton <wi...@redhat.com>
Date:   2014-07-07T17:52:26Z

    Added ContextCleaner.withCurrentCleaner
    
    This method allows code that needs access to the currently-active
    ContextCleaner to access it via a DynamicVariable.

commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton <wi...@redhat.com>
Date:   2014-03-18T14:55:57Z

    Added tests for variable capture in closures
    
    The two tests added to ClosureCleanerSuite ensure that variable values
    are captured at RDD definition time, not at job-execution time.

commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton <wi...@redhat.com>
Date:   2014-03-20T15:48:17Z

    Predictable closure environment capture
    
    The environments of serializable closures are now captured as
    part of closure cleaning. Since we already proactively check most
    closures for serializability, ClosureCleaner.clean now returns
    the result of deserializing the serialized version of the cleaned
    closure.
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/SparkContext.scala

commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton <wi...@redhat.com>
Date:   2014-03-26T04:45:45Z

    Skip proactive closure capture for runJob
    
    There are two possible cases for runJob calls: either they are called
    by RDD action methods from inside Spark or they are called from client
    code. There's no need to proactively check the closure argument to
    runJob for serializability or force variable capture in either case:
    
    1. if they are called by RDD actions, their closure arguments consist
    of mapping an already-serializable closure (with an already-frozen
    environment) to each element in the RDD;
    
    2. in both cases, the closure is about to execute and thus the benefit
    of proactively checking for serializability (or ensuring immediate
    variable capture) is nonexistent.
    
    (Note that ensuring capture via serializability on closure arguments to
    runJob also causes pyspark accumulators to fail to update.)
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/SparkContext.scala

commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton <wi...@redhat.com>
Date:   2014-03-26T16:31:56Z

    Split closure-serializability failure tests
    
    This splits the test identifying expected failures due to
    closure serializability into three cases.

commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton <wi...@redhat.com>
Date:   2014-04-04T21:39:55Z

    Fixed style issues in tests
    
    Conflicts:
    	core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton <wi...@redhat.com>
Date:   2014-04-04T22:15:50Z

    Stylistic changes and cleanups
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/SparkContext.scala

commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton <wi...@redhat.com>
Date:   2014-05-02T14:23:49Z

    Removed proactive closure serialization from DStream
    
    Conflicts:
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton <wi...@redhat.com>
Date:   2014-07-07T20:47:41Z

    Support tracking clones of broadcast variables.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-53515447
  
    BTW @willb, if this is not ready, do you mind closing the PR and resending when it is? We'd like to minimize the number of open PRs that aren't actively being reviewed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: spark-729: predictable closure capture

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49913436
  
    @mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down.  Coincidentally, it's what I was switching context back to when I saw this comment.
    
    @rxin, can you point me to the broadcast change so I can track it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-53948235
  
    Alright, feel free to describe this on the JIRA too if you'd like input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: spark-729: predictable closure capture

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-48250514
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16383/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by willb <gi...@git.apache.org>.
Github user willb closed the pull request at:

    https://github.com/apache/spark/pull/1322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: spark-729: predictable closure capture

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49913556
  
    QA tests have started for PR 1322. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17050/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49925954
  
    Matei was  referring to https://github.com/apache/spark/pull/1498


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-48250512
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-48242207
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49949418
  
    Alright, just ping me (with @mateiz) when you think it's ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-53903366
  
    @mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it.  I'll reopen when I have a solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: spark-729: predictable closure capture

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49913042
  
    Jenkins, add to whitelist and test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-48242186
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49926246
  
    Thanks, @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: spark-729: predictable closure capture

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1322#issuecomment-49913101
  
    @rxin you may want to look at this with your broadcast change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---