You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "James Nguyen (Jira)" <ji...@apache.org> on 2019/10/11 17:34:00 UTC

[jira] [Created] (SPARK-29443) Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods

James Nguyen created SPARK-29443:
------------------------------------

             Summary: Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods
                 Key: SPARK-29443
                 URL: https://issues.apache.org/jira/browse/SPARK-29443
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Spark Submit
    Affects Versions: 2.4.4
         Environment: EKS version: v1.14.6-eks-5047ed
Spark version: 2.4.4
Kubernetes client version: 4.4.2
            Reporter: James Nguyen


We are running Spark on Kubernetes (EKS), there are some Spark jobs which driver pod keep creating and terminating executor pods. Executor pods only live for few seconds. Some jobs have been stuck for days, and spun up to thousand executor pods.
{code}
➜ kubectl get pods --field-selector=status.phase=Running -n spark
NAME READY STATUS RESTARTS AGE
log-parser-k8s.structured-log.growth-referral-late-1570797184638 1/1 Running 0 111m
log-parser-k8s.structured-log.growth-referral-late-1570798665864 1/1 Running 0 87m
log-parser-k8s.structured-log.growth-referral-late-1570801912533 1/1 Running 0 33m
log-parser-k8s.structured-log.growth-referral-late-1570803823618 1/1 Running 0 77s
log-parser-k8s.structured-log.orbital-latest-struc-1570797051892 1/1 Running 0 114m
log-parser-k8s.structured-log.orbital-latest-struc-1570803823552 1/1 Running 0 77s
logparserstructured-1570797185983-exec-1727 1/1 Running 0 4s
logparserstructured-1570801913787-exec-431 1/1 Running 0 4s
logparserstructured-1570803824926-exec-1 1/1 Running 0 69s
logparserstructured-1570803824926-exec-2 1/1 Running 0 69s
{code}

Run _kubectl get pods_, we can see bunch of executor pods are being terminated.
{code}
logparserstructured-1570786196291-exec-1511 0/1 Terminating 0 7s
logparserstructured-1570786196291-exec-1513 0/1 Terminating 0 6s
logparserstructured-1570786196291-exec-1516 0/1 Terminating 0 5s
logparserstructured-1570786196291-exec-1518 0/1 Terminating 0 4s
logparserstructured-1570786196291-exec-1519 0/1 Terminating 0 3s
logparserstructured-1570786196291-exec-1520 0/1 Terminating 0 3s
logparserstructured-1570786196291-exec-1521 0/1 Terminating 0 2s
logparserstructured-1570786196291-exec-1522 0/1 Terminating 0 2s
logparserstructured-1570786196291-exec-1523 0/1 Terminating 0 1s
logparserstructured-1570786196291-exec-1524 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1487 0/1 Terminating 0 7s
logparserstructured-1570786469589-exec-1489 0/1 Terminating 0 6s
logparserstructured-1570786469589-exec-1491 0/1 Terminating 0 5s
logparserstructured-1570786469589-exec-1493 0/1 Terminating 0 4s
logparserstructured-1570786469589-exec-1494 0/1 Terminating 0 3s
logparserstructured-1570786469589-exec-1495 0/1 Terminating 0 2s
logparserstructured-1570786469589-exec-1496 0/1 Terminating 0 2s
logparserstructured-1570786469589-exec-1497 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1498 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1499 0/1 Terminating 0 0s
logparserstructured-1570786469589-exec-1500 0/1 Terminating 0 0s
logparserstructured-1570787682660-exec-1293 0/1 Terminating 0 8s
logparserstructured-1570787682660-exec-1299 0/1 Terminating 0 5s
logparserstructured-1570787682660-exec-1300 0/1 Terminating 0 5s
logparserstructured-1570787682660-exec-1301 0/1 Terminating 0 4s
logparserstructured-1570787682660-exec-1303 0/1 Terminating 0 3s
logparserstructured-1570787682660-exec-1305 0/1 Terminating 0 2s
logparserstructured-1570787682660-exec-1306 0/1 Terminating 0 2s
logparserstructured-1570787682660-exec-1307 0/1 Terminating 0 1s
logparserstructured-1570787682660-exec-1308 0/1 Terminating 0 1s
{code}

Checked driver pod logs:
{code}
19/10/11 09:54:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/11 09:54:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://logparserstructured-1570787682660-driver-svc.spark.svc:4040
19/10/11 09:54:51 INFO SparkContext: Added JAR s3://datastore/source-code/log-analytics-assembly-2.4.0.jar at s3://datastore/source-code/log-analytics-assembly-2.4.0.jar with timestamp 1570787691192
19/10/11 09:54:52 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
19/10/11 09:54:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
19/10/11 09:54:52 INFO NettyBlockTransferService: Server created on logparserstructured-1570787682660-driver-svc.spark.svc:7079
19/10/11 09:54:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/11 09:54:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManagerMasterEndpoint: Registering block manager logparserstructured-1570787682660-driver-svc.spark.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: pods "logparserstructured-1570787682660-exec-1" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=logparserstructured-1570787682660-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "logparserstructured-1570787682660-exec-1" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:503)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:442)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:406)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
{code}
{code}
9/10/11 09:55:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
19/10/11 09:55:37 INFO BlockManagerMaster: Removal of executor 8 requested
19/10/11 09:55:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 8
19/10/11 09:55:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:55:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:56:00 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
19/10/11 09:56:02 INFO BlockManagerMaster: Removal of executor 9 requested
19/10/11 09:56:02 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 9
19/10/11 09:56:02 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:04 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/10/11 09:56:05 INFO BlockManagerMaster: Removal of executor 10 requested
19/10/11 09:56:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 10
19/10/11 09:56:06 INFO BlockManagerMaster: Removal of executor 11 requested
19/10/11 09:56:06 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
19/10/11 09:56:06 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 11
19/10/11 09:56:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:56:22 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:24 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
19/10/11 09:56:25 INFO BlockManagerMaster: Removal of executor 12 requested
19/10/11 09:56:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 12
19/10/11 09:56:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
19/10/11 09:56:26 INFO BlockManagerMaster: Removal of executor 13 requested
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org