You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2015/04/29 21:57:13 UTC

[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/5787

    [SPARK-7237] Many user provided closures are not actually cleaned

    Note: ~110 lines are tests.
    
    In a nutshell, we never cleaned closures provided through the following operations:
    - sortBy
    - keyBy
    - mapPartitions
    - mapPartitionsWithIndex
    - aggregateByKey
    - foldByKey
    - foreachAsync
    - one of the aliases for runJob
    
    For more details on a reproduction and why they were not cleaned, please see [SPARK-7237](https://issues.apache.org/jira/browse/SPARK-7237).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark clean-more

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5787.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5787
    
----
commit 19e33b4f6e3e8534fa6f7b63de458a0ee11b14d6
Author: Andrew Or <an...@databricks.com>
Date:   2015-04-29T19:42:38Z

    Add tests for all public RDD APIs that take in closures
    
    Tests should fail as of this commit because the issue hasn't been
    fixed yet.

commit 9ac5f9b818713f8a7699f44994e4dbab2ebc9dba
Author: Andrew Or <an...@databricks.com>
Date:   2015-04-29T19:51:23Z

    Clean closures that are not currently cleaned
    
    Now the test added in the previous commit passes!

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29396990
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    No sure if needs to clean `seqOp`. Since `seqOp` will be passed to `combineByKey`,  looks combineByKey will clean it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97635393
  
      [Test build #31359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31359/consoleFull) for   PR 5787 at commit [`6498f44`](https://github.com/apache/spark/commit/6498f4427667bf3e956e9e5903c80742a511111d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97573619
  
      [Test build #31318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31318/consoleFull) for   PR 5787 at commit [`e83699e`](https://github.com/apache/spark/commit/e83699ef94a47b17ee4ea7fe07678b3b48633c89).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97595157
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97608706
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31328/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29397858
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    Yes, but it doesn't hurt to clean it again anyway I don't think. Otherwise we have to track where the closures end up being passed to and rely on that method not changing in behavior in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97608705
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29398986
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    You are already tracking it ain't you? Since reduceBykey is not cleaning ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97621338
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29395906
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1686,7 +1687,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
         val callSite = getCallSite
         logInfo("Starting job: " + callSite.shortForm)
         val start = System.nanoTime
    -    val result = dagScheduler.runApproximateJob(rdd, func, evaluator, callSite, timeout,
    +    val cleanedFunc = clean(func)
    +    val result = dagScheduler.runApproximateJob(rdd, cleanedFunc, evaluator, callSite, timeout,
    --- End diff --
    
    No need to clean here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97577517
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31318/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97565439
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97573067
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97577494
  
      [Test build #31318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31318/consoleFull) for   PR 5787 at commit [`e83699e`](https://github.com/apache/spark/commit/e83699ef94a47b17ee4ea7fe07678b3b48633c89).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29396908
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -177,7 +180,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[V]((v: V) => func(createZero(), v), func, func, partitioner)
    +    val cleanedFunc = self.context.clean(func)
    +    combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner)
    --- End diff --
    
    No sure if needs to clean `func`. Since `func` will be passed to `combineByKey`, `combineByKey` will clean it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97577514
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97565922
  
      [Test build #31314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31314/consoleFull) for   PR 5787 at commit [`8ac3074`](https://github.com/apache/spark/commit/8ac3074282dcb222a442e0d649256806bcf1f0c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29396836
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    No need to clean `combOp`. `combineByKey` will clean it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29395964
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
    @@ -119,7 +119,8 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
        * Applies a function f to each partition of this RDD.
        */
       def foreachPartitionAsync(f: Iterator[T] => Unit): FutureAction[Unit] = {
    -    self.context.submitJob[T, Unit, Unit](self, f, Range(0, self.partitions.length),
    +    val cleanedF = self.context.clean(f)
    +    self.context.submitJob[T, Unit, Unit](self, cleanedF, Range(0, self.partitions.length),
    --- End diff --
    
    No need to clean here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97635402
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31359/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97585874
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97621496
  
      [Test build #31359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31359/consoleFull) for   PR 5787 at commit [`6498f44`](https://github.com/apache/spark/commit/6498f4427667bf3e956e9e5903c80742a511111d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29447824
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    I think we have a rule here implicitly. All public methods in RDD(PairRDDFunctions, ...) that accepts closure arguments should be responsible for cleaning the closures. 
    
    So we don't need to track where the closures end up. Just check if the closure argument is passed to a public method. If so, we don't need to clean it. Otherwise, it should be cleaned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97585346
  
    Um, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97635401
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97586184
  
      [Test build #31328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31328/consoleFull) for   PR 5787 at commit [`e83699e`](https://github.com/apache/spark/commit/e83699ef94a47b17ee4ea7fe07678b3b48633c89).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29396754
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
    @@ -119,7 +119,8 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
        * Applies a function f to each partition of this RDD.
        */
       def foreachPartitionAsync(f: Iterator[T] => Unit): FutureAction[Unit] = {
    -    self.context.submitJob[T, Unit, Unit](self, f, Range(0, self.partitions.length),
    +    val cleanedF = self.context.clean(f)
    +    self.context.submitJob[T, Unit, Unit](self, cleanedF, Range(0, self.partitions.length),
    --- End diff --
    
    No need to clean here. `submitJob` will clean `f`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97623263
  
    Sorry. Just deleted my wrong comments. 
    
    `mapPartitionsWithContext` needs to be updated too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97565524
  
    @zsxwing @pwendell Please have a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97595159
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31314/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29395477
  
    --- Diff: core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala ---
    @@ -180,3 +232,90 @@ class TestClassWithNesting(val y: Int) extends Serializable {
         }
       }
     }
    +
    +/**
    + * Test whether closures passed in through public APIs are actually cleaned.
    + *
    + * We put a return statement in each of these closures as a mechanism to detect whether the
    + * ClosureCleaner actually cleaned our closure. If it did, then it would throw an appropriate
    + * exception explicitly complaining about the return statement. Otherwise, we know the
    + * ClosureCleaner did not actually clean our closure, in which case we should fail the test.
    + */
    +private object TestUserClosuresActuallyCleaned {
    +  def testMap(rdd: RDD[Int]): Unit = { rdd.map { _ => return; 0 }.count() }
    +  def testFlatMap(rdd: RDD[Int]): Unit = { rdd.flatMap { _ => return; Seq() }.count() }
    +  def testFilter(rdd: RDD[Int]): Unit = { rdd.filter { _ => return; true }.count() }
    +  def testSortBy(rdd: RDD[Int]): Unit = { rdd.sortBy { _ => return; 1 }.count() }
    +  def testKeyBy(rdd: RDD[Int]): Unit = { rdd.keyBy { _ => return; 1 }.count() }
    +  def testGroupBy(rdd: RDD[Int]): Unit = { rdd.groupBy { _ => return; 1 }.count() }
    --- End diff --
    
    Oops good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97595136
  
      [Test build #31314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31314/consoleFull) for   PR 5787 at commit [`8ac3074`](https://github.com/apache/spark/commit/8ac3074282dcb222a442e0d649256806bcf1f0c2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97573108
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29396020
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -131,7 +131,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
         val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
     
    -    combineByKey[U]((v: V) => seqOp(createZero(), v), seqOp, combOp, partitioner)
    +    val cleanedSeqOp = self.context.clean(seqOp)
    +    val cleanedCombOp = self.context.clean(combOp)
    +    combineByKey[U](
    +      (v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, cleanedCombOp, partitioner)
    --- End diff --
    
    No need to clean `combOp`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97621350
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97608698
  
      [Test build #31328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31328/consoleFull) for   PR 5787 at commit [`e83699e`](https://github.com/apache/spark/commit/e83699ef94a47b17ee4ea7fe07678b3b48633c89).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5787#discussion_r29386473
  
    --- Diff: core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala ---
    @@ -180,3 +232,90 @@ class TestClassWithNesting(val y: Int) extends Serializable {
         }
       }
     }
    +
    +/**
    + * Test whether closures passed in through public APIs are actually cleaned.
    + *
    + * We put a return statement in each of these closures as a mechanism to detect whether the
    + * ClosureCleaner actually cleaned our closure. If it did, then it would throw an appropriate
    + * exception explicitly complaining about the return statement. Otherwise, we know the
    + * ClosureCleaner did not actually clean our closure, in which case we should fail the test.
    + */
    +private object TestUserClosuresActuallyCleaned {
    +  def testMap(rdd: RDD[Int]): Unit = { rdd.map { _ => return; 0 }.count() }
    +  def testFlatMap(rdd: RDD[Int]): Unit = { rdd.flatMap { _ => return; Seq() }.count() }
    +  def testFilter(rdd: RDD[Int]): Unit = { rdd.filter { _ => return; true }.count() }
    +  def testSortBy(rdd: RDD[Int]): Unit = { rdd.sortBy { _ => return; 1 }.count() }
    +  def testKeyBy(rdd: RDD[Int]): Unit = { rdd.keyBy { _ => return; 1 }.count() }
    +  def testGroupBy(rdd: RDD[Int]): Unit = { rdd.groupBy { _ => return; 1 }.count() }
    --- End diff --
    
    No test for `testGroupBy`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97585850
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7237] Many user provided closures are n...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5787#issuecomment-97565401
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org