You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Weiwei Yang (Jira)" <ji...@apache.org> on 2021/03/20 05:24:00 UTC

[jira] [Commented] (YUNIKORN-588) Placeholder pods are not cleaned up timely when the Spark driver fails

    [ https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305305#comment-17305305 ] 

Weiwei Yang commented on YUNIKORN-588:
--------------------------------------

hi [~yuchaoran2011] have you tried to delete the spark application CRD and see if the placeholder will be deleted?
You are right, generally, the placeholder lifecycle should be decoupled with the operator plugins. This has been implemented in a general way in YUNIKORN-521. I think the clean-up will be triggered when you delete the driver pod, or delete the sparkApplication CRD. This is leveraging the owner-reference for garbage collection.

> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
>                 Key: YUNIKORN-588
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-588
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 0.10
>            Reporter: Chaoran Yu
>            Priority: Major
>              Labels: spark
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon running (e.g. due to an error in the Spark application code), the placeholder pods will still try to reserve resources. They won't be terminated until after the configured timeout has passed, even though they should have been cleaned up the moment that the driver failed. Because we already knew at that point, none of the executors would have a chance to start. 
>  Something probably needs to be done at the Spark operator plugin level to activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin because the user might not be using it. The Spark job could well have been submitted via spark-submit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org