You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Hadoop QA (Jira)" <ji...@apache.org> on 2020/05/29 15:17:00 UTC
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause
apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119698#comment-17119698 ]
Hadoop QA commented on YARN-10295:
----------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 16m 1s{color} | {color:red} Docker failed to build yetus/hadoop:a6371bfdb8c. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10295 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004349/YARN-10295.001.branch-3.1.patch |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26084/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
This message was automatically generated.
> CapacityScheduler NPE can cause apps to get stuck without resources
> -------------------------------------------------------------------
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 3.1.0, 3.2.0
> Reporter: Benjamin Teke
> Assignee: Benjamin Teke
> Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with resource=<memory:4096, vCores:1>
> 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
> 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], because at the time _node.getReservedContainer()_ wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
> }
> return null;
> }
> {code}
> A fix would be to store the container object before the if block.
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org