You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marshall Markham <mm...@precisionlender.com> on 2021/01/22 22:22:25 UTC

K8S spark-submit Loses Successful Driver Completion

Hi,

I am running Airflow + Spark + AKS (Azure K8s). Sporadically, when I have a spark job complete, my spark-submit process does not notice that the driver has succeeded and continues to track the job as running. Does anyone know how spark-submit process monitors driver processes on k8s? My expectation is that it monitors them by HTTP, but since we actually deleted the driver pod and the spark-submit process continued to show the process as in progress, I am now questioning this assumption. My end goal is to have spark-submit track driver behavior more accurately.


  *   Marshall


NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message.

Re: K8S spark-submit Loses Successful Driver Completion

Posted by Attila Zsolt Piros <pi...@gmail.com>.
Hi,

I am not using Airflow but I assume your application is deployed in cluster
mode and in this case the class you are looking for is
*org.apache.spark.deploy.k8s.submit.Client* [1].

If we are talking about the first "spark-submit" used to start the
application and not "spark-submit --status" then it contains loop where the
application status is logged. This loop stops when the
*LoggingPodStatusWatcher* reports the app is completed [2] or when
"spark.kubernetes.submission.waitAppCompletion" [3] is false.

And you are right the monitoring (POD state watching) is done via REST
(HTTPS) and should be detected by 
"io.fabric8.kubernetes.client.Watcher.onClose()" method so by the kubernetes
client.

I hope this helps. Some further questions if you need some more help:

1. What is the Spark version you are running? 
2. Does it contain SPARK-24266 [4]? 
3. If yes can you reproduce the issue without airflow and do you have the
logs about the issue? 

Best regards,
Attila

[1]
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L88-L103

[2]
https://github.com/apache/spark/blob/8604db28b87b387bbdb3761df85fae292cd402a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L162-L166

[3]
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala#L112-L114

[4] https://issues.apache.org/jira/browse/SPARK-24266




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org