You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by willb <gi...@git.apache.org> on 2014/03/20 22:29:45 UTC

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

GitHub user willb opened a pull request:

    https://github.com/apache/spark/pull/189

    SPARK-729:  Closures not always serialized at capture time

    [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured.  Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  There are a few possible approaches to solving this problem and this PR will discuss some of them.  The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing.
    
    ## What I did
    
    The basic approach I took depends on the work I did for #143, and so this PR is based atop that.  Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled).  Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used.
    
    Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument).  This branch modifies `ClosureCleaner.clean` so that it returns a value:  if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time.  `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument.
    
    I've added tests for this behavior (777a1bc).  The pull request as it stands, given the changes in #143, is nearly trivial.  There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important.  I think this is a fine way to solve this problem, but it's not perfect.
    
    ## What we might want to do
    
    The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure.  If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later.  By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like:
    
    ```scala
    package org.apache.spark.util
     
    import java.nio.ByteBuffer
    import scala.reflect.ClassManifest
     
    sealed abstract class ClosureBox[T] { def func: T }
    final case class RawClosure[T](func: T) extends ClosureBox[T] {}
    final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
    final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {}
     
    object ClosureBoxImplicits {
      implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun)
    }
    ```
    
    With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`.  A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer).  Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time).  Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate.  Tracking this sort of state in the type system seems like the right thing to do to me.
    
    ### Why we might not want to do that
    
    _It's pretty invasive._  Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`.  This obscures what's going on and is not a little ugly.  Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions).
    
    _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._  It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future.  Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal.
    
    Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once.
    
    ## Other possibilities
    
    I felt like the serialize-and-deserialize approach was best due to its obvious simplicity.  But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`.  It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness.
    
    I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark spark-729

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #189
    
----
commit 21b4b063372bfbdd289ff770d3d38cb4453e7ca6
Author: William Benton <wi...@redhat.com>
Date:   2014-03-13T02:56:32Z

    Test cases for SPARK-897.
    
    Tests to make sure that passing an unserializable closure
    to a transformation fails fast.

commit d8df3dbbed3e1c9e16a2c2002c04c2d2e6a0b2e2
Author: William Benton <wi...@redhat.com>
Date:   2014-03-13T19:40:42Z

    Adds proactive closure-serializablilty checking
    
    ClosureCleaner.clean now checks to ensure that its closure argument
    is serializable by default and throws a SparkException with the
    underlying NotSerializableException in the detail message otherwise.
    As a result, transformation invocations with unserializable closures
    will fail at their call sites rather than when they actually execute.
    
    ClosureCleaner.clean now takes a second boolean argument; pass false
    to disable serializability-checking behavior at call sites where this
    behavior isn't desired.

commit d5947b33f8e22ee498473abd59795d4f15a7b198
Author: William Benton <wi...@redhat.com>
Date:   2014-03-14T16:40:56Z

    Ensure assertions in Graph.apply are asserted.
    
    The Graph.apply test in GraphSuite had some assertions in a closure in
    a graph transformation. This caused two problems:
    
    1. because assert() was called, test classes were reachable from the
    closures, which made them not serializable, and
    
    2. (more importantly) these assertions never actually executed, since
    they occurred within a lazy map()
    
    This commit simply changes the Graph.apply test to collects the graph
    triplets so it can assert about each triplet from a map method.

commit 4ecf84100e22224aed204c3c4251c6ab20ff8bf6
Author: William Benton <wi...@redhat.com>
Date:   2014-03-14T17:33:33Z

    Make proactive serializability checking optional.
    
    SparkContext.clean uses ClosureCleaner's proactive serializability
    checking by default.  This commit adds an overloaded clean method
    to SparkContext that allows clients to specify that serializability
    checking should not occur as part of closure cleaning.

commit d6e8dd6469ef24ee0631d7c5bb424498715c59f5
Author: William Benton <wi...@redhat.com>
Date:   2014-03-14T17:34:42Z

    Don't check serializability of DStream transforms.
    
    Since the DStream is reachable from within these closures, they aren't
    checkable by the straightforward technique of passing them to the
    closure serializer.

commit 9c5cf90996bd7f9b8afbfa05e539cf68e2e48ee4
Author: William Benton <wi...@redhat.com>
Date:   2014-03-18T01:12:46Z

    Remove closure-serializablity test in DAGScheduler
    
    Since all closures that will be marshaled via the DAGScheduler will
    already be checked to ensure serializability at transformation or
    action invocation by ClosureCleaner.clean, there is no need to check
    these for serializability again when jobs are submitted.

commit b276f0ed2fc619edbb3a48617bed8de8f84beb6b
Author: William Benton <wi...@redhat.com>
Date:   2014-03-18T14:55:57Z

    Added tests for variable capture in closures
    
    The two tests added to ClosureCleanerSuite ensure that variable values
    are captured at RDD definition time, not at job-execution time.

commit c73efa8ae9a83b4d62ae37a763a6454d51d3d359
Author: William Benton <wi...@redhat.com>
Date:   2014-03-20T15:48:17Z

    Predictable closure environment capture
    
    The environments of serializable closures are now captured as
    part of closure cleaning. Since we already proactively check most
    closures for serializability, ClosureCleaner.clean now returns
    the result of deserializing the serialized version of the cleaned
    closure.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39489148
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38226711
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38329369
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13328/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38726848
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38839409
  
    I have rebased this branch to remove the commit that took out the serializability check in `DAGScheduler`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39489309
  
    Hey Will, sorry for the really late reply, still interested in having this go in, at least with the simple version. I've just been swamped with other reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38329365
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39610969
  
    (e.g. here:  https://github.com/willb/spark/commit/12c63a7e03bce359fd7eb7faf0a054bd32f85824#diff-f949ef08cc8a2b36861af3beb4309a88R161)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38340085
  
    The Python accumulators still work on master and with the changes from #143 (which serializes the closure at closure cleaning but doesn't use the serialized value for anything).
    
    As far as I can tell, accumulators in general still work after this change; at least all of the non-pyspark tests that use accumulators all still work after this change (I have been re-running `JavaAPISuite`, `AccumulatorSuite`, `DistributedSuite`, `ReplSuite`, and `AsyncRDDActionsSuite` individually while experimenting with this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38835891
  
     Merged build triggered. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/189


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/189#discussion_r11314584
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1031,9 +1033,16 @@ class SparkContext(
        * Clean a closure to make it ready to serialized and send to tasks
        * (removes unreferenced variables in $outer's, updates REPL variables)
        */
    -  private[spark] def clean[F <: AnyRef](f: F): F = {
    -    ClosureCleaner.clean(f)
    -    f
    +  private[spark] def clean[F <: AnyRef : ClassTag](f: F): F = {
    +    clean(f, true)
    +  }
    +
    +  /**
    +   * Clean a closure to make it ready to serialized and send to tasks
    +   * (removes unreferenced variables in $outer's, updates REPL variables)
    +   */
    +  private[spark] def clean[F <: AnyRef : ClassTag](f: F, checkSerializable: Boolean): F = {
    +    ClosureCleaner.clean(f, checkSerializable)
       }
    --- End diff --
    
    Instead of adding two methods, just add a default argument to the first one (`captureNow: Boolean = true`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39612630
  
    Oh never mind, I see that `ensureSerializable` *is* actually cloning it. I had misunderstood (partly due to its name) and thought it was returning the original function. In that case the PR is doing the right thing, but perhaps we can improve the explanation and names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39617193
  
    OK, I've made the changes and will push updates after re-running tests locally.  (I'll also follow up on the Python accumulators.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/189#discussion_r11314441
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -896,7 +896,9 @@ class SparkContext(
           require(p >= 0 && p < rdd.partitions.size, s"Invalid partition requested: $p")
         }
         val callSite = getCallSite
    -    val cleanedFunc = clean(func)
    +    // There's no need to check this function for serializability,
    +    // since it will be run right away.
    +    val cleanedFunc = clean(func, false)
    --- End diff --
    
    IMO you might as well clone it here too, because it could be modified in other threads while the job is running.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39812228
  
    I'm still tracking down exactly where this problem is coming from, but here's a little more detail on what's going wrong with Python accumulators when the closure passed to `runJob` is captured.  In this case, `CompletionEvent` objects have  `accumUpdates` fields that map from accumulator IDs to empty lists.  As far as I can tell, the same byte arrays for accumulator updates are getting passed over the wire in both cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39613507
  
    It's fine to add a commit. Our merge script will rebase your branch and squash it into a single patch at the end.
    
    For the Python accumulators, it would be good to understand why it happened. We can leave serializing out of runJob but perhaps we can also modify the Python accumulator class to fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39144319
  
    It looks like this Travis error is the same one others have seen on the dev list -- that is, the hive test is timing out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38231652
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39626034
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39489560
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39490073
  
    Thanks for taking another look, Matei!  I know there's a lot of stuff to get in before the merge window closes and appreciate the update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39613329
  
    @mateiz, these naming and stylistic suggestions make sense; thanks!
    
    Cloning the closure in `runJob` is what caused Python accumulators to stop working.  I will have to check over my notes to see if I can remember why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39627122
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39496207
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13730/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38337889
  
    Interesting, have you tried them without your change too? The PySpark accumulators are backed by Java ones, but maybe there's a problem with these being serialized too early or something. Do Java accumulators in general still work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38324080
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40035930
  
    Thanks Will. As per our email discussion, I've merged this in as is, and we can update the call in runJob once we figure out the Python issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38226712
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39626030
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38733405
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40086624
  
    OK, I found a couple of failures after rebasing.  I'll look at this today and figure out where things are going wrong.  Thanks, and sorry for the confusion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39627124
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13789/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39489550
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38738160
  
    In the discussion on #143, we were talking about whether or not it made sense to omit the closure serializability check in `DAGScheduler`.  Since this branch, which combines serializability checking and variable capture, removes serializability checking for the closure argument to `SparkContext.runJob`, I think it makes sense to either (1) re-introduce the check in `DAGScheduler`, or (2) add another option to the internal `clean` methods (in `SparkContext` and `ClosureCleaner`) allowing checking for serializability without forcing variable capture.
    
    I am inclined to prefer the first option but am interested in other opinions on the matter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38881864
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38733407
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13475/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38658563
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13459/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39626070
  
    Matei, my latest commit addresses the style and naming issues; I'll need to dig in to the cause of the Python issue some more over the weekend.  Thanks again for your feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40121349
  
    @willb no problem - it was our fault for not bouncing the tests. Do you mind re-opening this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/189#discussion_r11314422
  
    --- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala ---
    @@ -101,7 +105,7 @@ private[spark] object ClosureCleaner extends Logging {
         }
       }
       
    -  def clean(func: AnyRef) {
    +  def clean[F <: AnyRef : ClassTag](func: F, checkSerializable: Boolean = true): F = {
    --- End diff --
    
    We may want to rename `checkSerializable` to `captureNow` to make it clear that we're not just checking for serializability, we're serializing and returning the current version of the function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40057817
  
    Hey @mateiz @willb @andrewor14 @mengxr,
    
    I've been trying to investigate a bunch of Jenkins failures in other pull requests and I believe they might be a result of this patch being merged in. I've reverted this patch in master as the next step. Tomorrow we can see if this fixes the issues. If it doesn't apologies for the false warning. Note that this means this pull request will need to be re-opened.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38835915
  
    Merged build started. One or more automated tests failed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38652915
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/189#discussion_r11314401
  
    --- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala ---
    @@ -150,6 +154,21 @@ private[spark] object ClosureCleaner extends Logging {
           field.setAccessible(true)
           field.set(func, outer)
         }
    +    
    +    if (checkSerializable) {
    +      ensureSerializable(func)
    +    } else {
    +      func
    +    }
    +  }
    +
    +  private def ensureSerializable[T: ClassTag](func: T) = {
    --- End diff --
    
    I'd rename this to `cloneViaSerializing`. Also add an explicit return type to the function (not just `= { ...`, but `: T = { ...`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39612740
  
    Yes - that would be good. Thanks for clarifying. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40122237
  
    Not at all; I'll submit a new PR after I have everything passing locally.  (It doesn't look like it's possible to reopen PRs that have been merged.)  Thanks again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40070735
  
    @pwendell That's odd, since I didn't see any failures locally either.  I haven't rebased this branch in a while, though, so I can check again.  Can you send me a link to the failing tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-40060812
  
    Okay - confirmed this was the issue. There was a brief period where Jenkins was giving false positives on tests, the test result on this PR from 6 days ago was likely during that period.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39496205
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39612395
  
    By "clone", I mean serialize it to a byte array, than read it back as an Object from that byte array. That's going to clone it if it's Serializable.
    
    @willb in that test, are you checking the return value? Or just printing it somewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39613300
  
    Alright, made a few comments throughout the code, but it should be good to go with these changes. Thanks for taking a look at this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39612508
  
    Actually I see you are, but the behavior may be because your `zero` is outside the `withSpark` value, which is also a closure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39610852
  
    So I just added this to `ClosureCleanerSuite` and I'm not seeing the same behavior.   (I already had added a captured-field test to `ClosureCleanerSuite`.)  I don't have a test that depends on deep copying of referenced objects with mutable state, though; all of my tests have objects with mutable state that is of JVM primitive types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38843241
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13522/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38658561
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39604918
  
    Hey Will, I'd actually go even further on this one and *clone* the closure by serializing it when sc.clean() is called. That way it captures exactly the values of the variables it had when it was passed to Spark, not potentially changed values later.
    
    For example, consider something like this:
    ```
    val rdd = sc.parallelize(1 to 10)
    val data = Array(1, 2, 3)
    val mapped = rdd.map(x => data(0))
    data(0) = 4
    mapped.first
    ```
    
    Under the current version of Spark, as well as with this patch, this prints out "4", even though we called map() when Array(0) was 1. Is this the behavior we want?
    
    I can see this being too big a change for some programs, in which case we could leave it to just check for serializability in 1.0, and make this change later if it takes some further consideration. But it's worth thinking about. CCing @pwendell, @joshrosen, @rxin, @aarondav.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39611637
  
    It is hard to guarantee clone will work correctly for all user defined objects so I am not sure if it is a good idea to always clone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/189#discussion_r11314506
  
    --- Diff: core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.serializer;
    +
    +import java.io.NotSerializableException
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.SparkException
    +import org.apache.spark.SharedSparkContext
    +
    +/* A trivial (but unserializable) container for trivial functions */
    +class UnserializableClass {
    +  def op[T](x: T) = x.toString
    +  
    +  def pred[T](x: T) = x.toString.length % 2 == 0
    +}
    +
    +class ProactiveClosureSerializationSuite extends FunSuite with SharedSparkContext {
    +
    +  def fixture = (sc.parallelize(0 until 1000).map(_.toString), new UnserializableClass)
    +
    +  test("throws expected serialization exceptions on actions") {
    +    val (data, uc) = fixture
    +      
    +    val ex = intercept[SparkException] {
    +      data.map(uc.op(_)).count
    +    }
    +        
    +    assert(ex.getMessage.matches(".*Task not serializable.*"))
    +  }
    +
    +  // There is probably a cleaner way to eliminate boilerplate here, but we're
    +  // iterating over a map from transformation names to functions that perform that
    +  // transformation on a given RDD, creating one test case for each
    +  
    +  for (transformation <- 
    +      Map("map" -> xmap _, "flatMap" -> xflatMap _, "filter" -> xfilter _, "mapWith" -> xmapWith _,
    +          "mapPartitions" -> xmapPartitions _, "mapPartitionsWithIndex" -> xmapPartitionsWithIndex _,
    +          "mapPartitionsWithContext" -> xmapPartitionsWithContext _, "filterWith" -> xfilterWith _)) {
    +    val (name, xf) = transformation
    +    
    +    test(s"$name transformations throw proactive serialization exceptions") {
    +      val (data, uc) = fixture
    +      
    +      val ex = intercept[SparkException] {
    +        xf(data, uc)
    +      }
    +
    +      assert(ex.getMessage.matches(".*Task not serializable.*"), s"RDD.$name doesn't proactively throw NotSerializableException")
    +    }
    +  }
    +  
    +  private def xmap(x: RDD[String], uc: UnserializableClass): RDD[String] = x.map(y=>uc.op(y))
    +  private def xmapWith(x: RDD[String], uc: UnserializableClass): RDD[String] = x.mapWith(x => x.toString)((x,y)=>x + uc.op(y))
    +  private def xflatMap(x: RDD[String], uc: UnserializableClass): RDD[String] = x.flatMap(y=>Seq(uc.op(y)))
    +  private def xfilter(x: RDD[String], uc: UnserializableClass): RDD[String] = x.filter(y=>uc.pred(y))
    +  private def xfilterWith(x: RDD[String], uc: UnserializableClass): RDD[String] = x.filterWith(x => x.toString)((x,y)=>uc.pred(y))
    +  private def xmapPartitions(x: RDD[String], uc: UnserializableClass): RDD[String] = x.mapPartitions(_.map(y=>uc.op(y)))
    +  private def xmapPartitionsWithIndex(x: RDD[String], uc: UnserializableClass): RDD[String] = x.mapPartitionsWithIndex((_, it) => it.map(y=>uc.op(y)))
    +  private def xmapPartitionsWithContext(x: RDD[String], uc: UnserializableClass): RDD[String] = x.mapPartitionsWithContext((_, it) => it.map(y=>uc.op(y)))
    --- End diff --
    
    Split these guys onto multiple lines and make sure to follow the code style in terms of spaces around operators, etc. For example:
    
    ```
    private def xmap(x: RDD[String], uc: UnserializableClass): RDD[String] = 
      x.map(y => uc.op(y))
    
    private def xmapWith(x: RDD[String], uc: UnserializableClass): RDD[String] =
      x.mapWith(x => x.toString)((x,y) => x + uc.op(y))
    
    ...
    ```
    
    Also, since this is in a test, there's no need to write `private`, and probably no need to start these things with `x`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38330588
  
    @mateiz, I was pretty confused about this but it looks like the python accumulator tests are what is failing.  I'm not super-familiar with pyspark yet but am trying to figure out where things are going wrong.  (If you have any advice for where I should start looking, I'd appreciate it!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38726850
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39613399
  
    (BTW, as a matter of style, is it better to rebase my branch or add another commit that makes the change?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38322909
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38231654
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13308/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38324078
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38843239
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-38652914
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/189#issuecomment-39609855
  
    (Note that the same example works if `data` was a simple variable, e.g. `var data = 7`, but I wanted to make it obvious that there's an object there on which you're modifying a field.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---