You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Qi Zhu (Jira)" <ji...@apache.org> on 2021/03/16 02:58:00 UTC
[jira] [Updated] (YARN-10616) Nodemanagers cannot detect GPU failures

     [ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Qi Zhu updated YARN-10616:
--------------------------
        Parent: YARN-10690
    Issue Type: Sub-task  (was: Bug)

> Nodemanagers cannot detect GPU failures
> ---------------------------------------
>
>                 Key: YARN-10616
>                 URL: https://issues.apache.org/jira/browse/YARN-10616
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the failure. The NM will continue to schedule tasks onto the failed GPU, but the GPU won't actually work and so the container will likely fail or run very slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the NM-RM heartbeat and have the RM update its view of the NM's resource capabilities on each heartbeat. This would be a fairly trivial change, but comes with the unfortunate side effect that it completely undermindes {{yarn rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the assumption is that the node will retain these new resource capabilities until either the NM or RM is restarted. But with a heartbeat interaction constantly updating those resource capabilities from the NM perspective, the explicit changes via {{-updateNodeResource}} would be lost on the next heartbeat. We could potentially add a flag to ignore the heartbeat updates for any node who has had {{-updateNodeResource}} called on it (until a re-registration). But in this case, the node would no longer get resource capability updates until the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, then that would give potentially unexpected behavior in relation to nodes properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that number decreased, the node would hook into the health check status and mark itself as unhealthy. The downside of this approach is that a single failed GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, but I also don't like taking down whole GPU nodes when only a single GPU is bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org