You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/04/21 09:53:41 UTC

[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/5606

    [SPARK-3386] Share and reuse SerializerInstances in shuffle code paths

    This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones.  Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths.
    
    The key change in this patch is modifying `getDiskWriter()` / `DiskBlockObjectWriter` to accept `SerializerInstance`s instead of `Serializer`s (which are factories for instances).  This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used.
    
    The rest of the patch modifies several write and read paths to use shared serializers.  One big win is in `ShuffleBlockFetcherIterator`, where we used to create a new serializer per received block.  Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time.
    
    I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-3386

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5606.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5606
    
----
commit aeb680e986474e9e16bb61b8a3e165d1be1c4c2e
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-04-21T07:22:55Z

    [SPARK-3386] Reuse SerializerInstance in shuffle code paths

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94724515
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30654/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94895140
  
      [Test build #30681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30681/consoleFull) for   PR 5606 at commit [`f661ce7`](https://github.com/apache/spark/commit/f661ce7f26cf1a6131bebaaa051ed25542b774a2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94686460
  
      [Test build #30653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30653/consoleFull) for   PR 5606 at commit [`aeb680e`](https://github.com/apache/spark/commit/aeb680e986474e9e16bb61b8a3e165d1be1c4c2e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5606#discussion_r28757880
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -47,6 +48,11 @@ private[spark] class CoarseGrainedExecutorBackend(
       var executor: Executor = null
       @volatile var driver: Option[RpcEndpointRef] = None
     
    +  // This is a thread-local in case we ever decide to change this to a non-thread-safe RpcEndpoint
    +  private[this] val ser: ThreadLocal[SerializerInstance] = new ThreadLocal[SerializerInstance] {
    --- End diff --
    
    actually now i think about it -- i think if the underlying thread pool keeps creating new threads (which it might), this thread local variable might lead to memory leak. maybe the best way to handle this is with your old one, and add a comment to the beginning of the class saying if we ever change it to support multiple threads, change here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94900422
  
      [Test build #30686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30686/consoleFull) for   PR 5606 at commit [`f661ce7`](https://github.com/apache/spark/commit/f661ce7f26cf1a6131bebaaa051ed25542b774a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5606#discussion_r28756864
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -47,6 +48,8 @@ private[spark] class CoarseGrainedExecutorBackend(
       var executor: Executor = null
       @volatile var driver: Option[RpcEndpointRef] = None
     
    +  private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()
    --- End diff --
    
    should we make a thread local version of this? i'm slightly worried we might change this from ThreadSafeRpcEndpoint to a non-thread-safe version (i.e. multiple threads), and then forget to change this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94878007
  
      [Test build #30681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30681/consoleFull) for   PR 5606 at commit [`f661ce7`](https://github.com/apache/spark/commit/f661ce7f26cf1a6131bebaaa051ed25542b774a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94899383
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5606#discussion_r28756765
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/FileShuffleBlockManager.scala ---
    @@ -133,7 +134,8 @@ class FileShuffleBlockManager(conf: SparkConf)
                   logWarning(s"Failed to remove existing shuffle file $blockFile")
                 }
               }
    -          blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
    +          blockManager.getDiskWriter(blockId, blockFile, serializerInstance, bufferSize,
    --- End diff --
    
    Note that this line is called one for every bucket (reduce task), since it's enclosed in
    
    ```
    Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
    ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5606


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94687302
  
    This made a large performance difference in a `local`-mode SparkSQL test that I was running today, cutting one of my query's times from 30 seconds to 9 seconds.  The benefit may be smaller for non-SQL jobs, since they're less likely to use costly-to-construct serializers.
    
    It would be good to verify that I haven't made any bad thread-safety / single-threadedness assumptions here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94698211
  
      [Test build #30654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30654/consoleFull) for   PR 5606 at commit [`64f8398`](https://github.com/apache/spark/commit/64f83982439db4f2ac4e6814dcc6a6ecdea82074).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94722521
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30653/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94722505
  
      [Test build #30653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30653/consoleFull) for   PR 5606 at commit [`aeb680e`](https://github.com/apache/spark/commit/aeb680e986474e9e16bb61b8a3e165d1be1c4c2e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5606#discussion_r28757470
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -47,6 +48,8 @@ private[spark] class CoarseGrainedExecutorBackend(
       var executor: Executor = null
       @volatile var driver: Option[RpcEndpointRef] = None
     
    +  private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()
    --- End diff --
    
    Done.  Even if we have to lookup the thread-local, that's probably cheaper than creating a whole new serializer for each task.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94724486
  
      [Test build #30654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30654/consoleFull) for   PR 5606 at commit [`64f8398`](https://github.com/apache/spark/commit/64f83982439db4f2ac4e6814dcc6a6ecdea82074).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94930036
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30686/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94930002
  
      [Test build #30686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30686/consoleFull) for   PR 5606 at commit [`f661ce7`](https://github.com/apache/spark/commit/f661ce7f26cf1a6131bebaaa051ed25542b774a2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94889898
  
    As an example of the speedup that this can give, here's a simple benchmark.
    
    Launch spark-shell with the following options:
    
    ```
    ./bin/spark-shell --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --master=local[8]
    ```
    
    Then, past the following commands into the shell:
    
    ```
    val start = System.currentTimeMillis()
    sc.parallelize(1 to 10000, 100).map(x => (x, x)).reduceByKey(_ + _, 100).count()
    println(System.currentTimeMillis() - start)
    
    ```
    
    Prior to this patch, this takes about 2.5 seconds to run (after a few warmup runs); after this patch, this same query takes around 600ms (after warmup).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94971685
  
    LGTM. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3386] Share and reuse SerializerInstanc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5606#issuecomment-94895158
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30681/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org