You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by tg...@apache.org on 2020/01/29 21:13:46 UTC

[spark] branch branch-2.4 updated: [SPARK-30512] Added a dedicated boss event loop group

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 12f4492  [SPARK-30512] Added a dedicated boss event loop group
12f4492 is described below

commit 12f449220dde0b8476f58806f7dee06fcb54da87
Author: Chandni Singh <ch...@linkedin.com>
AuthorDate: Wed Jan 29 15:02:48 2020 -0600

    [SPARK-30512] Added a dedicated boss event loop group
    
    ### What changes were proposed in this pull request?
    Adding a dedicated boss event loop group to the Netty pipeline in the External Shuffle Service to avoid the delay in channel registration.
    ```
       EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
          conf.getModuleName() + "-boss");
        EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
        conf.getModuleName() + "-server");
    
        bootstrap = new ServerBootstrap()
          .group(bossGroup, workerGroup)
          .channel(NettyUtils.getServerChannelClass(ioMode))
          .option(ChannelOption.ALLOCATOR, allocator)
    ```
    
    ### Why are the changes needed?
    We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service.
    ```
    java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
    	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
    	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
    	at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
    	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
    	at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
    	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
    	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
    ```
    The investigation that we have done is described here:
    https://github.com/netty/netty/issues/9890
    
    After adding `LoggingHandler` to the netty pipeline, we saw that the registration of the channel was getting delay which is because the worker threads are busy with the existing channels.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    We have tested the patch on our clusters and with a stress testing tool. After this change, we didn't see any SASL requests timing out. Existing unit tests pass.
    
    Closes #27240 from otterc/SPARK-30512.
    
    Authored-by: Chandni Singh <ch...@linkedin.com>
    Signed-off-by: Thomas Graves <tg...@apache.org>
    (cherry picked from commit 6b47ace27d04012bcff47951ea1eea2aa6fb7d60)
    Signed-off-by: Thomas Graves <tg...@apache.org>
---
 .../main/java/org/apache/spark/network/server/TransportServer.java | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java b/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
index 9c85ab2..da3fe30 100644
--- a/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
+++ b/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
@@ -91,9 +91,10 @@ public class TransportServer implements Closeable {
   private void init(String hostToBind, int portToBind) {
 
     IOMode ioMode = IOMode.valueOf(conf.ioMode());
-    EventLoopGroup bossGroup =
-      NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server");
-    EventLoopGroup workerGroup = bossGroup;
+    EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
+      conf.getModuleName() + "-boss");
+    EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
+      conf.getModuleName() + "-server");
 
     PooledByteBufAllocator allocator = NettyUtils.createPooledByteBufAllocator(
       conf.preferDirectBufs(), true /* allowCache */, conf.serverThreads());


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org