You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cong Feng (JIRA)" <ji...@apache.org> on 2016/06/22 17:09:03 UTC

[jira] [Created] (SPARK-16146) Spark application failed by Yarn preempting

Cong Feng created SPARK-16146:
---------------------------------

             Summary: Spark application failed by Yarn preempting
                 Key: SPARK-16146
                 URL: https://issues.apache.org/jira/browse/SPARK-16146
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.6.1
         Environment: Amazon EC2, centos 6.6,
Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, preemption and dynamic allocation enabled.
            Reporter: Cong Feng
            Priority: Critical


Hi,

We are setting up our Spark cluster on amz ec2. We are using Spark Yarn client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official web) and Hadoop 2.7.2. We also enable preemption and dynamic allocation. 

During our test we found our Spark application frequently get killed when the preemption happened. Mostly seems driver trying to send rpc to executor which has been preempted before, also there are some connect rest by peer exceptions which also cause job failed Below are the typical exceptions we found:

16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException

And 

16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.

It happens both to capacity scheduler and fair scheduler. The wired thing is when we rolled back to Spark 1.4.1, this issue magically disappeared and we can do the preemption smoothly.

But we still wants to deploy with Spark 1.6.1. Is this a bug or something we can fixed. Any ideas will be great helpful to us.

Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org