You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huiqiang Liu (JIRA)" <ji...@apache.org> on 2016/04/18 15:46:26 UTC
[jira] [Created] (SPARK-14699) Driver is marked as failed even it runs successfully

Huiqiang Liu created SPARK-14699:
------------------------------------

             Summary: Driver is marked as failed even it runs successfully
                 Key: SPARK-14699
                 URL: https://issues.apache.org/jira/browse/SPARK-14699
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1, 1.6.0
         Environment: Standalone deployment
            Reporter: Huiqiang Liu


We recently upgraded Spark from 1.5.2 to 1.6.0 and found that all batch jobs are marked as failed.
To address this issue, we wrote a simple test application which just sum up from 1 to 10000 and it is marked as failed even though its result was correct.
Here is the typical stderr message and there is "ERROR worker.WorkerWatcher: Lost connection to worker rpc" when driver exits.

16/04/14 06:20:41 INFO scheduler.DAGScheduler: ResultStage 1 (sum at SparkBatchTest.scala:19) finished in 0.052 s
16/04/14 06:20:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/04/14 06:20:41 INFO scheduler.DAGScheduler: Job 1 finished: sum at SparkBatchTest.scala:19, took 0.061177 s
16/04/14 06:20:41 ERROR worker.WorkerWatcher: Lost connection to worker rpc endpoint spark://Worker@spark-worker-ltv-prod-006.prod.vungle.com:7078. Exiting.
16/04/14 06:20:41 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 172.16.33.187:36442 in memory (size: 1452.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on ip-172-16-31-86.ec2.internal:29708 in memory (size: 1452.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on ip-172-16-32-207.ec2.internal:21259 in memory (size: 1452.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/04/14 06:20:41 INFO spark.ContextCleaner: Cleaned accumulator 2
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 172.16.33.187:36442 in memory (size: 804.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ip-172-16-31-86.ec2.internal:29708 in memory (size: 804.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ip-172-16-32-207.ec2.internal:21259 in memory (size: 804.0 B, free: 511.1 MB)
16/04/14 06:20:41 INFO spark.ContextCleaner: Cleaned accumulator 1
16/04/14 06:20:41 INFO ui.SparkUI: Stopped Spark web UI at http://172.16.31.56:4040
16/04/14 06:20:41 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/04/14 06:20:41 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/04/14 06:20:41 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/14 06:20:41 INFO storage.MemoryStore: MemoryStore cleared
16/04/14 06:20:41 INFO storage.BlockManager: BlockManager stopped
16/04/14 06:20:41 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/04/14 06:20:41 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/14 06:20:41 INFO spark.SparkContext: Successfully stopped SparkContext
16/04/14 06:20:41 INFO util.ShutdownHookManager: Shutdown hook called
16/04/14 06:20:41 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a4c7f9ac-bf40-4924-b977-ec0be4c5fe93

Since it looks like something related with rpc, we tested with Akka which was default before 1.6.0 by setting "spark.rpc=akka".
Then the error message dismisses and a warning complains the same thing about disassociation with worker.
It seems with Akka it waits another 5 seconds and that makes driver exists gracefully.
16/04/12 15:41:28 INFO DAGScheduler: ResultStage 1 (sum at SparkBatchTest.scala:19) finished in 0.053 s
16/04/12 15:41:28 INFO DAGScheduler: Job 1 finished: sum at SparkBatchTest.scala:19, took 0.060660 s
16/04/12 15:41:28 INFO SparkContext: Invoking stop() from shutdown hook
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/04/12 15:41:28 INFO SparkUI: Stopped Spark web UI at http://sparkbox:8830
16/04/12 15:41:28 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/04/12 15:41:28 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/04/12 15:41:28 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@sparkbox:58832] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/04/12 15:41:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/12 15:41:28 INFO MemoryStore: MemoryStore cleared
16/04/12 15:41:28 INFO BlockManager: BlockManager stopped
16/04/12 15:41:28 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/12 15:41:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/04/12 15:41:28 INFO SparkContext: Successfully stopped SparkContext
16/04/12 15:41:28 INFO ShutdownHookManager: Shutdown hook called
16/04/12 15:41:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-138182c2-f1bc-4ea1-b5b8-2adceb27c083

This only occurs with standalone deployment since it is said DriverWrapper is only used in that cluster mode.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala#L43

Typically it affects the '--supervise' usage for batch jobs, since they will be always considered as failed and get restarted again and again.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala#L214



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org