You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Igor Kamyshnikov (JIRA)" <ji...@apache.org> on 2019/06/06 17:06:00 UTC

[jira] [Updated] (SPARK-27967) Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts down

     [ https://issues.apache.org/jira/browse/SPARK-27967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Kamyshnikov updated SPARK-27967:
-------------------------------------
    Description: 
Synthetic test:
 1) run ZK
 2) run Master
 3) run Worker with remote debugging agent (required for enabling a breakpoint to demonstrate race conditions issue)
 4) submit a long running Driver with --supervise flag
 5) connect to Worker via remote debugger
 6) enable a breakpoint in the method:
 org.apache.spark.deploy.worker.DriverRunner#kill
{code:java}
  /** Terminate this driver (or prevent it from ever starting if not yet started) */
  private[worker] def kill(): Unit = {
    logInfo("Killing driver process!")
    killed = true
    synchronized {
      process.foreach { p =>
        val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
        if (exitCode.isEmpty) { //<<<<<<<<<<<<< BREAKPOINT <<<<<<<<<<<<<<
          logWarning("Failed to terminate driver process: " + p +
              ". This process will likely be orphaned.")
        }
      }
    }
  }
{code}
7) send SIGTERM to Worker (or CTRL+C in Windows)
 8) check Spark Master Web UI: the Driver will appear in the *Completed Drivers* section with the state equal to *KILLED*

If there was no breakpoint then it is more likely that a new row with *RELAUNCHING* state would appear in the *Completed Drivers* section and a row with *SUBMITTED* state would remain in the *Running Drivers* section.

Explanation:
 1) Spark master relaunches a driver in response to "channelInactive" callback: org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
 which is triggered when the Worker process finishes.
 2) DriverRunner registers a shutdown hook here: org.apache.spark.deploy.worker.DriverRunner#start which calls the aforementioned "kill" method. Killing a driver can lead to reaching the following lines in the DriverRunner.start method:
{noformat}
        // notify worker of final driver state, possible exception
        worker.send(DriverStateChanged(driverId, finalState.get, finalException))
{noformat}
If this notification reaches Master then Driver is removed from the cluster as KILLED.

Real-world scenario (ver. 2.1.2):
 ZK, two Masters, the Active one loses its leadership, another becomes a new leader.
 Workers attempt to re-register with new Master. But they report that they failed to do this. They execute *System.exit(1)* from org.apache.spark.deploy.worker.Worker#registerWithMaster.
 This System.exit results in executing shutdown hooks. And somehow DriverStateChanged message reaches the new master.

 

*Worker* logs:
{noformat}
19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
19/06/03 14:05:33 INFO Worker: Master has changed, new master is at spark://10.0.0.17:7077
19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from /10.0.0.16:7077 is closed
19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
java.io.IOException: Connection from /10.0.0.16:7077 closed
	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)
19/06/03 14:05:33 INFO DriverRunner: Worker shutting down, killing driver driver-20190508103758-0002
19/06/03 14:05:33 INFO DriverRunner: Killing driver process!
...
{noformat}
*Master* logs:
{noformat}
19/06/03 14:05:33 INFO Master: I have been elected leader! New state: RECOVERING
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:37487 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:34591 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:45169 after 11 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:34107 after 2 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:35662 after 4 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO Master: Recovery complete - resuming operations!
19/06/03 14:05:34 INFO Master: Received unregister request from application app-20190524081659-0002
19/06/03 14:05:34 INFO Master: Removing app app-20190524081659-0002
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:38 INFO Master: Removing driver: driver-20190508103758-0002
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/0
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/1
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:43 INFO Master: Removing app app-20190522170759-0002
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:45 INFO Master: Removing driver: driver-20190522170754-0002
{noformat}

  was:
Synthetic test:
 1) run ZK
 2) run Master
 3) run Worker with remote debugging agent (required for enabling a breakpoint to demonstrate race conditions issue)
 4) submit a long running Driver with --supervise flag
 5) connect to Worker via remote debugger
 6) enable a breakpoint in the method:
 org.apache.spark.deploy.worker.DriverRunner#kill
{code:java}
  /** Terminate this driver (or prevent it from ever starting if not yet started) */
  private[worker] def kill(): Unit = {
    logInfo("Killing driver process!")
    killed = true
    synchronized {
      process.foreach { p =>
        val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
        if (exitCode.isEmpty) { //<<<<<<<<<<<<< BREAKPOINT <<<<<<<<<<<<<<
          logWarning("Failed to terminate driver process: " + p +
              ". This process will likely be orphaned.")
        }
      }
    }
  }
{code}
7) send SIGTERM to Worker (or CTRL+C in Windows)
 8) check Spark Master Web UI: the Driver will appear in the *Completed Drivers* section with the state equal to *KILLED*

If there was no breakpoint then it is more likely that a new row with *RELAUNCHING* state would appear in the *Completed Drivers* section and a row with *SUBMITTED* state would remain in the *Running Drivers* section.

Explanation:
 1) Spark master relaunches a driver in response to "channelInactive" callback: org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
 which is triggered when the Worker process finishes.
 2) DriverRunner registers a shutdown hook here: org.apache.spark.deploy.worker.DriverRunner#start which calls the aforementioned "kill" method. Killing a driver can lead to reaching the following lines in the DriverRunner.start method:
{noformat}
        // notify worker of final driver state, possible exception
        worker.send(DriverStateChanged(driverId, finalState.get, finalException))
{noformat}
If this notification reaches Master then Driver is removed from the cluster as KILLED.

Real-world scenario (ver. 2.1.2):
 ZK, two Masters, the Active one loses its leadership, another becomes a new leader.
 Workers attempt to re-register with new master. But the report they failed to do this. They execute *System.exit(1)* from org.apache.spark.deploy.worker.Worker#registerWithMaster.
 This System.exit results in executing shutdown hooks. And somehow DriverStateChanged message reaches the new master.

 

*Worker* logs:
{noformat}
19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
19/06/03 14:05:33 INFO Worker: Master has changed, new master is at spark://10.0.0.17:7077
19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from /10.0.0.16:7077 is closed
19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
java.io.IOException: Connection from /10.0.0.16:7077 closed
	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)
19/06/03 14:05:33 INFO DriverRunner: Worker shutting down, killing driver driver-20190508103758-0002
19/06/03 14:05:33 INFO DriverRunner: Killing driver process!
...
{noformat}
*Master* logs:
{noformat}
19/06/03 14:05:33 INFO Master: I have been elected leader! New state: RECOVERING
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:37487 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:34591 after 10 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:45169 after 11 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:34107 after 2 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:35662 after 4 ms (0 ms spent in bootstraps)
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081656-0001
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081659-0002
19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190522170759-0002
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164806-10.0.0.16-34107
19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164804-10.0.0.17-35662
19/06/03 14:05:33 INFO Master: Recovery complete - resuming operations!
19/06/03 14:05:34 INFO Master: Received unregister request from application app-20190524081659-0002
19/06/03 14:05:34 INFO Master: Removing app app-20190524081659-0002
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
19/06/03 14:05:38 INFO Master: Removing driver: driver-20190508103758-0002
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/0
19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/1
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:43 INFO Master: Removing app app-20190522170759-0002
19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
19/06/03 14:05:45 INFO Master: Removing driver: driver-20190522170754-0002
{noformat}


> Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts down
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27967
>                 URL: https://issues.apache.org/jira/browse/SPARK-27967
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.2, 2.4.2
>            Reporter: Igor Kamyshnikov
>            Priority: Major
>
> Synthetic test:
>  1) run ZK
>  2) run Master
>  3) run Worker with remote debugging agent (required for enabling a breakpoint to demonstrate race conditions issue)
>  4) submit a long running Driver with --supervise flag
>  5) connect to Worker via remote debugger
>  6) enable a breakpoint in the method:
>  org.apache.spark.deploy.worker.DriverRunner#kill
> {code:java}
>   /** Terminate this driver (or prevent it from ever starting if not yet started) */
>   private[worker] def kill(): Unit = {
>     logInfo("Killing driver process!")
>     killed = true
>     synchronized {
>       process.foreach { p =>
>         val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
>         if (exitCode.isEmpty) { //<<<<<<<<<<<<< BREAKPOINT <<<<<<<<<<<<<<
>           logWarning("Failed to terminate driver process: " + p +
>               ". This process will likely be orphaned.")
>         }
>       }
>     }
>   }
> {code}
> 7) send SIGTERM to Worker (or CTRL+C in Windows)
>  8) check Spark Master Web UI: the Driver will appear in the *Completed Drivers* section with the state equal to *KILLED*
> If there was no breakpoint then it is more likely that a new row with *RELAUNCHING* state would appear in the *Completed Drivers* section and a row with *SUBMITTED* state would remain in the *Running Drivers* section.
> Explanation:
>  1) Spark master relaunches a driver in response to "channelInactive" callback: org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
>  which is triggered when the Worker process finishes.
>  2) DriverRunner registers a shutdown hook here: org.apache.spark.deploy.worker.DriverRunner#start which calls the aforementioned "kill" method. Killing a driver can lead to reaching the following lines in the DriverRunner.start method:
> {noformat}
>         // notify worker of final driver state, possible exception
>         worker.send(DriverStateChanged(driverId, finalState.get, finalException))
> {noformat}
> If this notification reaches Master then Driver is removed from the cluster as KILLED.
> Real-world scenario (ver. 2.1.2):
>  ZK, two Masters, the Active one loses its leadership, another becomes a new leader.
>  Workers attempt to re-register with new Master. But they report that they failed to do this. They execute *System.exit(1)* from org.apache.spark.deploy.worker.Worker#registerWithMaster.
>  This System.exit results in executing shutdown hooks. And somehow DriverStateChanged message reaches the new master.
>  
> *Worker* logs:
> {noformat}
> 19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
> 19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
> 19/06/03 14:05:33 INFO Worker: Master has changed, new master is at spark://10.0.0.17:7077
> 19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from /10.0.0.16:7077 is closed
> 19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
> java.io.IOException: Connection from /10.0.0.16:7077 closed
> 	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
> 	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
> 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> 	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
> 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
> 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> 	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
> 	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> 	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> 	at java.lang.Thread.run(Thread.java:748)
> 19/06/03 14:05:33 INFO DriverRunner: Worker shutting down, killing driver driver-20190508103758-0002
> 19/06/03 14:05:33 INFO DriverRunner: Killing driver process!
> ...
> {noformat}
> *Master* logs:
> {noformat}
> 19/06/03 14:05:33 INFO Master: I have been elected leader! New state: RECOVERING
> 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081659-0002
> 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081656-0001
> 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190522170759-0002
> 19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164806-10.0.0.16-34107
> 19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164804-10.0.0.17-35662
> 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:37487 after 10 ms (0 ms spent in bootstraps)
> 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:34591 after 10 ms (0 ms spent in bootstraps)
> 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:45169 after 11 ms (0 ms spent in bootstraps)
> 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:34107 after 2 ms (0 ms spent in bootstraps)
> 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:35662 after 4 ms (0 ms spent in bootstraps)
> 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081656-0001
> 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081659-0002
> 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190522170759-0002
> 19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164806-10.0.0.16-34107
> 19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164804-10.0.0.17-35662
> 19/06/03 14:05:33 INFO Master: Recovery complete - resuming operations!
> 19/06/03 14:05:34 INFO Master: Received unregister request from application app-20190524081659-0002
> 19/06/03 14:05:34 INFO Master: Removing app app-20190524081659-0002
> 19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
> 19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it.
> 19/06/03 14:05:38 INFO Master: Removing driver: driver-20190508103758-0002
> 19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/0
> 19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/1
> 19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
> 19/06/03 14:05:43 INFO Master: Removing app app-20190522170759-0002
> 19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it.
> 19/06/03 14:05:45 INFO Master: Removing driver: driver-20190522170754-0002
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org