You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Horrigan (Jira)" <ji...@apache.org> on 2022/04/05 22:42:00 UTC

[jira] [Updated] (SPARK-38794) When ConfigMap creation fails, Spark driver starts but fails to start executors

     [ https://issues.apache.org/jira/browse/SPARK-38794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Seth Horrigan updated SPARK-38794:
----------------------------------
    Description: 
When running Spark in Kubernetes client mode, all executors assume that a ConfigMap exactly matching `KubernetesClientUtils.configMapNameExecutor` will exist (see [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L98])

If the ConfigMap creation fails, [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L80], (due to the Kubernetes control plane being temporarily unavailable or the permissions of the serviceaccount being insufficient to create a ConfigMap), the driver will start fully, then will wait for executors that will forever fail to start due to "MountVolume.SetUp failed for volume \"spark-conf-volume-exec\" : configmap \"spark-exec-...-conf-map\" not found" 

 

Either the driver start-up should fail with an error, or the driver should retry the attempt to create the ConfigMap

--

To reproduce the problem when the Kubernetes control plane is not experiencing issues, start Spark in client mode, but do not give the Kubernetes ServiceAccount permission to create ConfigMap. The driver pod will start successfully, but the executor pods will terminate upon creation, and the driver will not create new executors.

  was:
When running Spark in Kubernetes client mode, all executors assume that a ConfigMap exactly matching `KubernetesClientUtils.configMapNameExecutor` will exist (see [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L98])

If the ConfigMap creation fails, [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L80], (due to the Kubernetes control plane being temporarily unavailable or the permissions of the serviceaccount being insufficient to create a ConfigMap), the driver will start fully, then will wait for executors that will forever fail to start due to "MountVolume.SetUp failed for volume \"spark-conf-volume-exec\" : configmap \"spark-exec-...-conf-map\" not found" 

 

Either the driver start-up should fail with an error, or the driver should retry the attempt to create the ConfigMap


> When ConfigMap creation fails, Spark driver starts but fails to start executors
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-38794
>                 URL: https://issues.apache.org/jira/browse/SPARK-38794
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>            Reporter: Seth Horrigan
>            Priority: Major
>
> When running Spark in Kubernetes client mode, all executors assume that a ConfigMap exactly matching `KubernetesClientUtils.configMapNameExecutor` will exist (see [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L98])
> If the ConfigMap creation fails, [https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L80], (due to the Kubernetes control plane being temporarily unavailable or the permissions of the serviceaccount being insufficient to create a ConfigMap), the driver will start fully, then will wait for executors that will forever fail to start due to "MountVolume.SetUp failed for volume \"spark-conf-volume-exec\" : configmap \"spark-exec-...-conf-map\" not found" 
>  
> Either the driver start-up should fail with an error, or the driver should retry the attempt to create the ConfigMap
> --
> To reproduce the problem when the Kubernetes control plane is not experiencing issues, start Spark in client mode, but do not give the Kubernetes ServiceAccount permission to create ConfigMap. The driver pod will start successfully, but the executor pods will terminate upon creation, and the driver will not create new executors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org