You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saisai Shao (JIRA)" <ji...@apache.org> on 2017/08/23 13:13:01 UTC

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

    [ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138320#comment-16138320 ] 

Saisai Shao commented on SPARK-17321:
-------------------------------------

We're facing the same issue. I think YARN shuffle service should be like:

* If NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications).
* If NM recovery is enabled, then user or yarn should guarantee recovery path is reliable. Because recovery path is also crucial for NM to recover.

What do you think [~tgraves] ? 

I'm currently working on the 1st thing to avoid persisting data into leveldb, to see if this is a feasible solution.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --------------------------------------------------------------------------
>
>                 Key: SPARK-17321
>                 URL: https://issues.apache.org/jira/browse/SPARK-17321
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 2.0.0, 2.1.1
>            Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline
> java.lang.NullPointerException
>         at org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77)
>         at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
>         at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
>         at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
>         at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
>         at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
>         at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
>         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org