You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Prashant Sharma (Jira)" <ji...@apache.org> on 2020/07/20 13:19:00 UTC

[jira] [Created] (SPARK-32371) Autodetect persistently failing executor pods and fail the application logging the cause.

Prashant Sharma created SPARK-32371:
---------------------------------------

             Summary: Autodetect persistently failing executor pods and fail the application logging the cause.
                 Key: SPARK-32371
                 URL: https://issues.apache.org/jira/browse/SPARK-32371
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 3.1.0
            Reporter: Prashant Sharma


{code:java}
[root@kyok-test-1 ~]# kubectl get po -w

NAME                                   READY   STATUS    RESTARTS   AGE

spark-shell-a3962a736bf9e775-exec-36   1/1     Running   0          5s

spark-shell-a3962a736bf9e775-exec-37   1/1     Running   0          3s

spark-shell-a3962a736bf9e775-exec-36   0/1     Error     0          5s

spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s

spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s

spark-shell-a3962a736bf9e775-exec-38   0/1     ContainerCreating   0          1s

spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-37   0/1     Error               0          5s

spark-shell-a3962a736bf9e775-exec-38   1/1     Running             0          2s

spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-38   0/1     Error               0          4s

spark-shell-a3962a736bf9e775-exec-39   1/1     Running             0          1s

spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s

spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s

spark-shell-a3962a736bf9e775-exec-39   0/1     Error               0          3s

spark-shell-a3962a736bf9e775-exec-40   1/1     Running             0          1s

spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-41   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-41   1/1     Running             0          2s

spark-shell-a3962a736bf9e775-exec-40   0/1     Error               0          4s

spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-42   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s

{code}
A cascade of creating and terminating pods within 3-4 seconds, is created. It is difficult to see the logs of these constantly created and terminated pods. Thankfully, there is an option
{code:java}
spark.kubernetes.executor.deleteOnTermination false  {code}
to turn off the auto deletion of executor pods, and gives us opportunity to diagnose the problem. However, this is not turned on by default, and sometimes one may need to guess what caused the problem the previous run and steps to reproduce it and then re run the application with exact same setup to reproduce.

So, it might be good, if we could somehow detect this situation, of pod failing as soon as they start or failing on particular task and capture the error that caused the pod to terminate and relay it back to driver and log it. 

Alternatively, if we could auto-detect this situation, we can also auto stop creating more executor pods and fail with appropriate error also retaining the last failed pod for user's further investigation.

So far it is not yet evaluated how this can be achieved, but, this feature might be useful for K8s growing as a preferred choice for deploying spark. Logging this issue for further investigation and work.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org