You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shashi Kangayam (Jira)" <ji...@apache.org> on 2021/01/14 22:18:00 UTC

[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

    [ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265426#comment-17265426 ] 

Shashi Kangayam commented on SPARK-33711:
-----------------------------------------

[~attilapiros]

We have our jobs on Spark-3.0.1 that are reflecting the same behavior. 

Can you please back port this fix to Spark-3.0

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --------------------------------------------------------------------------
>
>                 Key: SPARK-33711
>                 URL: https://issues.apache.org/jira/browse/SPARK-33711
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>            Reporter: Attila Zsolt Piros
>            Assignee: Attila Zsolt Piros
>            Priority: Major
>             Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD changes which could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single pod is changed without having a full consistent snapshot about all the PODs (see ExecutorPodsPollingSnapshotSource). The other could be a race between the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org