You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zuotingbing (JIRA)" <ji...@apache.org> on 2019/04/29 09:29:00 UTC

[jira] [Comment Edited] (SPARK-23191) Workers registration failes in case of network drop

    [ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829051#comment-16829051 ] 

zuotingbing edited comment on SPARK-23191 at 4/29/19 9:28 AM:
--------------------------------------------------------------

we faced the same issue in standalone HA mode. Could you please take a view on this issue?
{code:java}
2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at spark://vmax17:7077 
2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect.
2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax18:7077... 
2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax17:7077... 
2019-03-15 20:22:14,865 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect.
2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,868 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 
2019-03-15 20:22:14,868 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,871 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 
2019-03-15 20:22:14,871 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate worker ID
2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called 
2019-03-15 20:22:14,898 INFO ShutdownHookManager: Deleting directory /data4/zdh/spark/tmp/spark-c578bf32-6a5e-44a5-843b-c796f44648ee 
2019-03-15 20:22:14,908 INFO ShutdownHookManager: Deleting directory /data3/zdh/spark/tmp/spark-7e57e77d-cbb7-47d3-a6dd-737b57788533 
2019-03-15 20:22:14,920 INFO ShutdownHookManager: Deleting directory /data2/zdh/spark/tmp/spark-0beebf20-abbd-4d99-a401-3ef0e88e0b05{code}
 

[~andrewor14]  [~cloud_fan] [~vanzin]


was (Author: zuo.tingbing9):
we faced the same issue in standalone HA mode. Could you please take a view on this issue?

[~andrewor14]  [~cloud_fan] [~vanzin]

> Workers registration failes in case of network drop
> ---------------------------------------------------
>
>                 Key: SPARK-23191
>                 URL: https://issues.apache.org/jira/browse/SPARK-23191
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.3, 2.2.1, 2.3.0
>         Environment: OS:- Centos 6.9(64 bit)
>  
>            Reporter: Neeraj Gupta
>            Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  org.apache.spark.network.server.TransportChannelHandler - Exception in connection from <servername>
> java.io.IOException: Connection timed out
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>         at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>         at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  org.apache.spark.deploy.worker.Worker - Failed to connect to master <servername>:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>         at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>         at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>         at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>         at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to <servername>:7077
>         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>         at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
>         ... 4 more
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <servername>:7077
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
>         at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         ... 1 more
> 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID
> 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate worker ID{code}
>  # The worker state is changed to DEAD in spark UI. As a result of which duplicate driver is launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org