You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by tg...@apache.org on 2020/03/30 17:46:12 UTC

[spark] branch branch-3.0 updated: [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 7329c25  [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService
7329c25 is described below

commit 7329c256c6d02cbc700d367320ef20d215bca8aa
Author: manuzhang <ow...@gmail.com>
AuthorDate: Mon Mar 30 12:44:46 2020 -0500

    [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService
    
    ### What changes were proposed in this pull request?
    Close idle connections at shuffle server side when an `IdleStateEvent` is triggered after `spark.shuffle.io.connectionTimeout` or `spark.network.timeout` time. It's based on following investigations.
    
    1. We found connections on our clusters building up continuously (> 10k for some nodes). Is that normal ? We don't think so.
    2. We looked into the connections on one node and found there were a lot of half-open connections. (connections only existed on one node)
    3. We also checked those connections were very old (> 21 hours). (FYI, https://superuser.com/questions/565991/how-to-determine-the-socket-connection-up-time-on-linux)
    4. Looking at the code, TransportContext registers an IdleStateHandler which should fire an IdleStateEvent when timeout. We did a heap dump of the YarnShuffleService and checked the attributes of IdleStateHandler. It turned out firstAllIdleEvent of many IdleStateHandlers were already false so IdleStateEvent were already fired.
    5. Finally, we realized the IdleStateEvent would not be handled since closeIdleConnections are hardcoded to false for YarnShuffleService.
    
    ### Why are the changes needed?
    Idle connections to YarnShuffleService could never be closed, and will be accumulating and taking up memory and file descriptors.
    
    ### Does this PR introduce any user-facing change?
    No.
    
    ### How was this patch tested?
    Existing tests.
    
    Closes #27998 from manuzhang/spark-31219.
    
    Authored-by: manuzhang <ow...@gmail.com>
    Signed-off-by: Thomas Graves <tg...@apache.org>
    (cherry picked from commit 0d997e5156a751c99cd6f8be1528ed088a585d1f)
    Signed-off-by: Thomas Graves <tg...@apache.org>
---
 .../src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
index 815a56d..c41efba 100644
--- a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
+++ b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@@ -188,7 +188,7 @@ public class YarnShuffleService extends AuxiliaryService {
 
       int port = conf.getInt(
         SPARK_SHUFFLE_SERVICE_PORT_KEY, DEFAULT_SPARK_SHUFFLE_SERVICE_PORT);
-      transportContext = new TransportContext(transportConf, blockHandler);
+      transportContext = new TransportContext(transportConf, blockHandler, true);
       shuffleServer = transportContext.createServer(port, bootstraps);
       // the port should normally be fixed, but for tests its useful to find an open port
       port = shuffleServer.getPort();


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org