You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Igor Kamyshnikov (JIRA)" <ji...@apache.org> on 2019/06/06 17:01:00 UTC
[jira] [Created] (SPARK-27967) Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts down

Igor Kamyshnikov created SPARK-27967:
----------------------------------------

             Summary: Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts down
                 Key: SPARK-27967
                 URL: https://issues.apache.org/jira/browse/SPARK-27967
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.2, 2.1.2
            Reporter: Igor Kamyshnikov


Synthetic test:
 1) run ZK
 2) run Master
 3) run Worker with remote debugging agent (required for enabling a breakpoint to demonstrate race conditions issue)
 4) submit a long running Driver with --supervise flag
 5) connect to Worker via remote debugger
 6) enable a breakpoint in the method:
 org.apache.spark.deploy.worker.DriverRunner#kill
{code:java}
  /** Terminate this driver (or prevent it from ever starting if not yet started) */
  private[worker] def kill(): Unit = {
    logInfo("Killing driver process!")
    killed = true
    synchronized {
      process.foreach { p =>
        val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
        if (exitCode.isEmpty) { //<<<<<<<<<<<<< BREAKPOINT <<<<<<<<<<<<<<
          logWarning("Failed to terminate driver process: " + p +
              ". This process will likely be orphaned.")
        }
      }
    }
  }
{code}
7) send SIGTERM to Worker (or CTRL+C in Windows)
 8) check Spark Master Web UI: the Driver will appear in the *Completed Drivers* section with the state equal to *KILLED*

If there was no breakpoint then it is more likely that a new row with *RELAUNCHING* state would appear in the *Completed Drivers* section and a row with *SUBMITTED* state would remain in the *Running Drivers* section.

Explanation:
 1) Spark master relaunches a driver in response to "channelInactive" callback: org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
 which is triggered when the Worker process finishes.
 2) DriverRunner registers a shutdown hook here: org.apache.spark.deploy.worker.DriverRunner#start which calls the aforementioned "kill" method. Killing a driver can lead to reaching the following lines in the DriverRunner.start method:
{noformat}
        // notify worker of final driver state, possible exception
        worker.send(DriverStateChanged(driverId, finalState.get, finalException))
{noformat}
If this notification reaches Master then Driver is removed from the cluster as KILLED.

Real-world scenario (ver. 2.1.2):
 ZK, two Masters, the Active one loses its leadership, another becomes a new leader.
 Workers attempt to re-register with new master. But the report they failed to do this. They execute *System.exit(1)* from org.apache.spark.deploy.worker.Worker#registerWithMaster.
 This System.exit results in executing shutdown hooks. And somehow DriverStateChanged message reaches the new master.

 

*Worker* logs:
{noformat}
19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
19/06/03 14:05:33 INFO Worker: Master has changed, new master is at spark://10.0.0.17:7077
19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from /10.0.0.16:7077 is closed
19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
java.io.IOException: Connection from /10.0.0.16:7077 closed
	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)
19/06/03 14:05:33 INFO DriverRunner: Worker shutting down, killing driver driver-20190508103758-0002
19/06/03 14:05:33 INFO DriverRunner: Killing driver process!
...
{noformat}
*Master* logs:
{noformat}
19/06/03 14:05:33 INFO Master: I have been elected leader! New state: RECOVERING
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:37487 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:34591 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:45169 after 11 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:34107 after 2 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:35662 after 4 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO Master: Recovery complete - resuming operations!
19/06/03 14:05:34 INFO Master: Received unregister request from application app-20190524081659-0002
19/06/03 14:05:34 INFO Master: Removing app app-20190524081659-0002
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:38 INFO Master: Removing driver: driver-20190508103758-0002
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/0
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/1
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:43 INFO Master: Removing app app-20190522170759-0002
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:45 INFO Master: Removing driver: driver-20190522170754-0002
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org