You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2019/07/23 21:37:00 UTC

[jira] [Created] (SPARK-28488) Race in k8s scheduler shutdown can lead to misleading exceptions.

Marcelo Vanzin created SPARK-28488:
--------------------------------------

             Summary: Race in k8s scheduler shutdown can lead to misleading exceptions.
                 Key: SPARK-28488
                 URL: https://issues.apache.org/jira/browse/SPARK-28488
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.0
            Reporter: Marcelo Vanzin


There's a race when shutting down the k8s scheduler backend that may cause ugly exceptions to show up in the logs:

{noformat}
19/07/22 14:43:46 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-0
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
        at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:162)
        at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:143)
        at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:193)
        at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:537)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecutor(CoarseGrainedSchedulerBackend.scala:509)
        at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.doRemoveExecutor(KubernetesClusterSchedulerBackend.scala:63)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.removeExecutorFromSpark(ExecutorPodsLifecycleManager.scala:143)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.$anonfun$onNewSnapshots$2(ExecutorPodsLifecycleManager.scala:64)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:465)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.$anonfun$onNewSnapshots$1(ExecutorPodsLifecycleManager.scala:59)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.$anonfun$onNewSnapshots$1$adapted(ExecutorPodsLifecycleManager.scala:58)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.onNewSnapshots(ExecutorPodsLifecycleManager.scala:58)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.$anonfun$start$1(ExecutorPodsLifecycleManager.scala:50)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsLifecycleManager.$anonfun$start$1$adapted(ExecutorPodsLifecycleManager.scala:50)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$callSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:110)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1330)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:107)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:80)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

Basically, because the scheduler endpoint is shut down before the executors used internally by the spark-on-k8s code, those may send messages to an endpoint that does not exist anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org