You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chandni Singh (Jira)" <ji...@apache.org> on 2020/01/14 19:58:00 UTC

[jira] [Commented] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

    [ https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015362#comment-17015362 ] 

Chandni Singh commented on SPARK-30512:
---------------------------------------

Please assign the issue to me so I can open up a PR.

> Use a dedicated boss event group loop in the netty pipeline for external shuffle service
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-30512
>                 URL: https://issues.apache.org/jira/browse/SPARK-30512
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.0.0
>            Reporter: Chandni Singh
>            Priority: Major
>
> We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker event group are same which is causing this delay.
> {code:java}
>     EventLoopGroup bossGroup =
>       NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server");
>     EventLoopGroup workerGroup = bossGroup;
>     bootstrap = new ServerBootstrap()
>       .group(bossGroup, workerGroup)
>       .channel(NettyUtils.getServerChannelClass(ioMode))
>       .option(ChannelOption.ALLOCATOR, allocator)
>       .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 thread.
> {code:java}
>     EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>       conf.getModuleName() + "-boss");
>     EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
>     conf.getModuleName() + "-server");
>     bootstrap = new ServerBootstrap()
>       .group(bossGroup, workerGroup)
>       .channel(NettyUtils.getServerChannelClass(ioMode))
>       .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org