You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cong Feng (JIRA)" <ji...@apache.org> on 2016/06/22 17:14:58 UTC
[jira] [Updated] (SPARK-16146) Spark application failed by Yarn preempting

     [ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cong Feng updated SPARK-16146:
------------------------------
    Description: 
Hi,

We are setting up our Spark cluster on amz ec2. We are using Spark Yarn client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and spark.shuffle.service.enabled.

During our test we found our Spark application frequently get killed when the preemption happened. Mostly seems driver trying to send rpc to executor which has been preempted before, also there are some connect rest by peer exceptions which also cause job failed Below are the typical exceptions we found:

16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException

And 

16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.

It happens both to capacity scheduler and fair scheduler. The wired thing is when we rolled back to Spark 1.4.1, this issue magically disappeared and we can do the preemption smoothly.

But we still wants to deploy with Spark 1.6.1. Is this a bug or something we can fixed. Any ideas will be great helpful to us.

Thanks

  was:
Hi,

We are setting up our Spark cluster on amz ec2. We are using Spark Yarn client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official web) and Hadoop 2.7.2. We also enable preemption and dynamic allocation. 

During our test we found our Spark application frequently get killed when the preemption happened. Mostly seems driver trying to send rpc to executor which has been preempted before, also there are some connect rest by peer exceptions which also cause job failed Below are the typical exceptions we found:

16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
        at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException

And 

16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)
16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.

It happens both to capacity scheduler and fair scheduler. The wired thing is when we rolled back to Spark 1.4.1, this issue magically disappeared and we can do the preemption smoothly.

But we still wants to deploy with Spark 1.6.1. Is this a bug or something we can fixed. Any ideas will be great helpful to us.

Thanks


> Spark application failed by Yarn preempting
> -------------------------------------------
>
>                 Key: SPARK-16146
>                 URL: https://issues.apache.org/jira/browse/SPARK-16146
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: Amazon EC2, centos 6.6,
> Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, preemption and dynamic allocation enabled.
>            Reporter: Cong Feng
>            Priority: Critical
>
> Hi,
> We are setting up our Spark cluster on amz ec2. We are using Spark Yarn client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and spark.shuffle.service.enabled.
> During our test we found our Spark application frequently get killed when the preemption happened. Mostly seems driver trying to send rpc to executor which has been preempted before, also there are some connect rest by peer exceptions which also cause job failed Below are the typical exceptions we found:
> 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
> java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: java.nio.channels.ClosedChannelException
>         at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
>         at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
>         at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>         at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
>         at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
>         at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
>         at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
>         at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
>         at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
>         at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
>         at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
>         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> And 
> 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
> 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>         at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>         at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.
> It happens both to capacity scheduler and fair scheduler. The wired thing is when we rolled back to Spark 1.4.1, this issue magically disappeared and we can do the preemption smoothly.
> But we still wants to deploy with Spark 1.6.1. Is this a bug or something we can fixed. Any ideas will be great helpful to us.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org