You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:42:22 UTC

[jira] [Resolved] (SPARK-24221) Retry spark app submission to k8 in KubernetesClientApplication

     [ https://issues.apache.org/jira/browse/SPARK-24221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24221.
----------------------------------
    Resolution: Incomplete

> Retry spark app submission to k8 in KubernetesClientApplication
> ---------------------------------------------------------------
>
>                 Key: SPARK-24221
>                 URL: https://issues.apache.org/jira/browse/SPARK-24221
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Yifei Huang
>            Priority: Major
>              Labels: bulk-closed
>
> Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in addition to executors, could suffer from init-container failures in Kubernetes. Currently, we fail the entire application if that's the case, so it's up to the client to detect those errors and retry. However, since both driver and executor initialization have the same failure case, it seems like we're repeating logic in two places. Would it be better to consolidate this retry logic in `KubernetesClientApplication`?
> We could still count executor pod initialization failures in `KubernetesClusterSchedulerBackend` and decide what to do with the application if there are too many failures there, but we'd be guaranteed a set number of retries for each executor before giving up. Or would this be too confusing and obfuscate the true number of retries? We could also configure the number of driver and executor retries separately. It just seems like if we're tackling init-container failure retries for executors, we should also provide support for drivers as well since they suffer from the same problem. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org