You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrei Stankevich (Jira)" <ji...@apache.org> on 2020/02/04 02:53:00 UTC

[jira] [Created] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

Andrei Stankevich created SPARK-30720:
-----------------------------------------

             Summary: Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.
                 Key: SPARK-30720
                 URL: https://issues.apache.org/jira/browse/SPARK-30720
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.3
            Reporter: Andrei Stankevich


We are using spark 2.4.3 with mesos and with external shuffle service. External shuffle service is launched using systemd by command `

exec /*/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosExternalShuffleService

` 

Sometimes spark executor has connection timeout when it tries to connect to external shuffle service. When it happens spark executor throws an exception 

`ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 more times after waiting 5 seconds...`

If connection timeout happens 4 more times spark executor throws an error

`ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Unable to register with external shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337`

 

After this error Spark application just hangs. On Mesos UI it goes to inactive frameworks and on Spark Driver UI I can see few failed tasks and looks like it does nothing.

 

Full spark executor log is 

```

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle server, will retry 1 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.<init>(Executor.scala:118)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Unable to register with external shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
org.apache.spark.SparkException: Unable to register with external shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.<init>(Executor.scala:118)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 ... 12 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
20/01/31 16:29:25 INFO DiskBlockManager: Shutdown hook called
20/01/31 16:29:25 INFO ShutdownHookManager: Shutdown hook called
I0131 16:29:25.446748 3768 executor.cpp:1039] Command exited with status 1 (pid: 3795)
I0131 16:29:26.447976 3794 process.cpp:935] Stopped the socket accept loop

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org