You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "James Nguyen (Jira)" <ji...@apache.org> on 2019/10/11 17:43:00 UTC
[jira] [Updated] (SPARK-29443) Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods

     [ https://issues.apache.org/jira/browse/SPARK-29443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Nguyen updated SPARK-29443:
---------------------------------
    Environment: 
EKS version: v1.14.6-eks-5047ed with 18 x r5a.12xlarge EC2 instances as Worker nodes
Spark version: 2.4.4
Kubernetes client version: 4.4.2

  was:
EKS version: v1.14.6-eks-5047ed
Spark version: 2.4.4
Kubernetes client version: 4.4.2


> Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods
> ------------------------------------------------------------------------
>
>                 Key: SPARK-29443
>                 URL: https://issues.apache.org/jira/browse/SPARK-29443
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Spark Submit
>    Affects Versions: 2.4.4
>         Environment: EKS version: v1.14.6-eks-5047ed with 18 x r5a.12xlarge EC2 instances as Worker nodes
> Spark version: 2.4.4
> Kubernetes client version: 4.4.2
>            Reporter: James Nguyen
>            Priority: Major
>              Labels: Kubernetes
>
> We are running Spark on Kubernetes (EKS), there are some Spark jobs which driver pod keep creating and terminating executor pods. Executor pods only live for few seconds. Some jobs have been stuck for days, and spun up to thousand executor pods.
> {code}
> ➜ kubectl get pods --field-selector=status.phase=Running -n spark
> NAME READY STATUS RESTARTS AGE
> log-parser-k8s.structured-log.growth-referral-late-1570797184638 1/1 Running 0 111m
> log-parser-k8s.structured-log.growth-referral-late-1570798665864 1/1 Running 0 87m
> log-parser-k8s.structured-log.growth-referral-late-1570801912533 1/1 Running 0 33m
> log-parser-k8s.structured-log.growth-referral-late-1570803823618 1/1 Running 0 77s
> log-parser-k8s.structured-log.orbital-latest-struc-1570797051892 1/1 Running 0 114m
> log-parser-k8s.structured-log.orbital-latest-struc-1570803823552 1/1 Running 0 77s
> logparserstructured-1570797185983-exec-1727 1/1 Running 0 4s
> logparserstructured-1570801913787-exec-431 1/1 Running 0 4s
> logparserstructured-1570803824926-exec-1 1/1 Running 0 69s
> logparserstructured-1570803824926-exec-2 1/1 Running 0 69s
> {code}
> Run _kubectl get pods_, we can see bunch of executor pods are being terminated.
> {code}
> logparserstructured-1570786196291-exec-1511 0/1 Terminating 0 7s
> logparserstructured-1570786196291-exec-1513 0/1 Terminating 0 6s
> logparserstructured-1570786196291-exec-1516 0/1 Terminating 0 5s
> logparserstructured-1570786196291-exec-1518 0/1 Terminating 0 4s
> logparserstructured-1570786196291-exec-1519 0/1 Terminating 0 3s
> logparserstructured-1570786196291-exec-1520 0/1 Terminating 0 3s
> logparserstructured-1570786196291-exec-1521 0/1 Terminating 0 2s
> logparserstructured-1570786196291-exec-1522 0/1 Terminating 0 2s
> logparserstructured-1570786196291-exec-1523 0/1 Terminating 0 1s
> logparserstructured-1570786196291-exec-1524 0/1 Terminating 0 1s
> logparserstructured-1570786469589-exec-1487 0/1 Terminating 0 7s
> logparserstructured-1570786469589-exec-1489 0/1 Terminating 0 6s
> logparserstructured-1570786469589-exec-1491 0/1 Terminating 0 5s
> logparserstructured-1570786469589-exec-1493 0/1 Terminating 0 4s
> logparserstructured-1570786469589-exec-1494 0/1 Terminating 0 3s
> logparserstructured-1570786469589-exec-1495 0/1 Terminating 0 2s
> logparserstructured-1570786469589-exec-1496 0/1 Terminating 0 2s
> logparserstructured-1570786469589-exec-1497 0/1 Terminating 0 1s
> logparserstructured-1570786469589-exec-1498 0/1 Terminating 0 1s
> logparserstructured-1570786469589-exec-1499 0/1 Terminating 0 0s
> logparserstructured-1570786469589-exec-1500 0/1 Terminating 0 0s
> logparserstructured-1570787682660-exec-1293 0/1 Terminating 0 8s
> logparserstructured-1570787682660-exec-1299 0/1 Terminating 0 5s
> logparserstructured-1570787682660-exec-1300 0/1 Terminating 0 5s
> logparserstructured-1570787682660-exec-1301 0/1 Terminating 0 4s
> logparserstructured-1570787682660-exec-1303 0/1 Terminating 0 3s
> logparserstructured-1570787682660-exec-1305 0/1 Terminating 0 2s
> logparserstructured-1570787682660-exec-1306 0/1 Terminating 0 2s
> logparserstructured-1570787682660-exec-1307 0/1 Terminating 0 1s
> logparserstructured-1570787682660-exec-1308 0/1 Terminating 0 1s
> {code}
> Checked driver pod logs:
> {code}
> 19/10/11 09:54:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
> 19/10/11 09:54:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://logparserstructured-1570787682660-driver-svc.spark.svc:4040
> 19/10/11 09:54:51 INFO SparkContext: Added JAR s3://datastore/source-code/log-analytics-assembly-2.4.0.jar at s3://datastore/source-code/log-analytics-assembly-2.4.0.jar with timestamp 1570787691192
> 19/10/11 09:54:52 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
> 19/10/11 09:54:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
> 19/10/11 09:54:52 INFO NettyBlockTransferService: Server created on logparserstructured-1570787682660-driver-svc.spark.svc:7079
> 19/10/11 09:54:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
> 19/10/11 09:54:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
> 19/10/11 09:54:52 INFO BlockManagerMasterEndpoint: Registering block manager logparserstructured-1570787682660-driver-svc.spark.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
> 19/10/11 09:54:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
> 19/10/11 09:54:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
> 19/10/11 09:54:52 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: pods "logparserstructured-1570787682660-exec-1" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=logparserstructured-1570787682660-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "logparserstructured-1570787682660-exec-1" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
>  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:503)
>  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:442)
>  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:406)
>  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
>  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
>  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
>  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
>  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71)
>  at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> {code}
> 9/10/11 09:55:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
> 19/10/11 09:55:37 INFO BlockManagerMaster: Removal of executor 8 requested
> 19/10/11 09:55:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 8
> 19/10/11 09:55:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
> 19/10/11 09:55:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
> 19/10/11 09:56:00 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
> 19/10/11 09:56:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
> 19/10/11 09:56:02 INFO BlockManagerMaster: Removal of executor 9 requested
> 19/10/11 09:56:02 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 9
> 19/10/11 09:56:02 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
> 19/10/11 09:56:04 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
> 19/10/11 09:56:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
> 19/10/11 09:56:05 INFO BlockManagerMaster: Removal of executor 10 requested
> 19/10/11 09:56:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 10
> 19/10/11 09:56:06 INFO BlockManagerMaster: Removal of executor 11 requested
> 19/10/11 09:56:06 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
> 19/10/11 09:56:06 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 11
> 19/10/11 09:56:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
> 19/10/11 09:56:22 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
> 19/10/11 09:56:24 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
> 19/10/11 09:56:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
> 19/10/11 09:56:25 INFO BlockManagerMaster: Removal of executor 12 requested
> 19/10/11 09:56:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 12
> 19/10/11 09:56:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
> 19/10/11 09:56:26 INFO BlockManagerMaster: Removal of executor 13 requested
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org