You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2017/12/12 21:21:00 UTC
[jira] [Commented] (YARN-7645) TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is flakey with FairScheduler

    [ https://issues.apache.org/jira/browse/YARN-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288290#comment-16288290 ] 

Robert Kanter commented on YARN-7645:
-------------------------------------

When the test passes, we see this sequence of log messages:
{noformat}
2017-12-11 11:21:36,837 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(919)) - appattempt_1513020094849_0001_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2017-12-11 11:21:36,837 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppNodeUpdateEvent.EventType: NODE_UPDATE
2017-12-11 11:21:36,837 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(870)) - Processing event for application_1513020094849_0001 of type NODE_UPDATE
2017-12-11 11:21:36,837 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(986)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2017-12-11 11:21:36,837 INFO  [Thread-1] resourcemanager.MockRM (MockRM.java:waitForState(283)) - App State is : ACCEPTED
2017-12-11 11:21:36,838 INFO  [Thread-1] resourcemanager.MockRM (MockRM.java:waitForState(357)) - Attempt State is : SCHEDULED
2017-12-11 11:21:36,838 INFO  [Thread-1] resourcemanager.MockRM (MockRM.java:launchAM(1168)) - Launch AM appattempt_1513020094849_0001_000001
2017-12-11 11:21:36,979 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE
2017-12-11 11:21:36,979 DEBUG [Thread-1] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-11 11:21:36,979 DEBUG [Thread-1] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.default is <memory:0, vCores:0>
2017-12-11 11:21:36,979 DEBUG [Thread-1] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.default <memory:0, vCores:0>; Total resource demand for root now <memory:0, vCores:0>
2017-12-11 11:21:36,979 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(666)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE
2017-12-11 11:21:36,985 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
2017-12-11 11:21:36,986 DEBUG [Thread-1] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.user is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-11 11:21:36,986 DEBUG [Thread-1] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.user is <memory:10240, vCores:0>
2017-12-11 11:21:36,986 DEBUG [Thread-1] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.user <memory:1024, vCores:1>; Total resource demand for root now <memory:1024, vCores:1>
2017-12-11 11:21:36,986 DEBUG [Thread-1] fair.FSParentQueue (FSParentQueue.java:updateDemand(144)) - The updated demand for root is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-11 11:21:36,986 DEBUG [Thread-1] fair.FSQueue (FSQueue.java:setFairShare(293)) - The updated fairShare for root is <memory:10240, vCores:10>
2017-12-11 11:21:36,987 DEBUG [AsyncDispatcher event handler] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:nodeUpdate(1083)) - nodeUpdate: 127.0.0.1:1234 cluster capacity: <memory:10240, vCores:10>
2017-12-11 11:21:36,987 DEBUG [AsyncDispatcher event handler] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:nodeUpdate(1116)) - Node being looked for scheduling 127.0.0.1:1234 availableResource: <memory:10240, vCores:10>
2017-12-11 11:21:36,988 DEBUG [AsyncDispatcher event handler] fair.FSLeafQueue (FSLeafQueue.java:assignContainer(333)) - Node 127.0.0.1 offered to queue: root.user fairShare: <memory:10240, vCores:0>
2017-12-11 11:21:37,049 DEBUG [AsyncDispatcher event handler] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updateMetricsForAllocatedContainer(589)) - allocate: applicationId=application_1513020094849_0001 container=container_1513020094849_0001_01_000001 host=127.0.0.1:1234 user=user resource=<memory:1024, vCores:1> type=OFF_SWITCH
{noformat}
When it fails, we see this:
{noformat}
2017-12-08 11:58:46,248 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(919)) - appattempt_1512763125850_0001_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2017-12-08 11:58:46,249 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppNodeUpdateEvent.EventType: NODE_UPDATE
2017-12-08 11:58:46,249 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(870)) - Processing event for application_1512763125850_0001 of type NODE_UPDATE
2017-12-08 11:58:46,249 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(986)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2017-12-08 11:58:46,254 INFO  [Thread-2] resourcemanager.MockRM (MockRM.java:waitForState(283)) - App State is : ACCEPTED
2017-12-08 11:58:46,254 INFO  [Thread-2] resourcemanager.MockRM (MockRM.java:waitForState(357)) - Attempt State is : SCHEDULED
2017-12-08 11:58:46,254 INFO  [Thread-2] resourcemanager.MockRM (MockRM.java:launchAM(1168)) - Launch AM appattempt_1512763125850_0001_000001
2017-12-08 11:58:46,280 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE
2017-12-08 11:58:46,280 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(666)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE
2017-12-08 11:58:46,281 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
2017-12-08 11:58:46,281 DEBUG [AsyncDispatcher event handler] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:nodeUpdate(1083)) - nodeUpdate: 127.0.0.1:1234 cluster capacity: <memory:10240, vCores:10>
2017-12-08 11:58:46,282 DEBUG [AsyncDispatcher event handler] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:nodeUpdate(1116)) - Node being looked for scheduling 127.0.0.1:1234 availableResource: <memory:10240, vCores:10>
2017-12-08 11:58:46,282 DEBUG [AsyncDispatcher event handler] fair.FSLeafQueue (FSLeafQueue.java:assignContainer(333)) - Node 127.0.0.1 offered to queue: root.default fairShare: <memory:0, vCores:0>
2017-12-08 11:58:46,282 DEBUG [AsyncDispatcher event handler] fair.FSLeafQueue (FSLeafQueue.java:assignContainer(333)) - Node 127.0.0.1 offered to queue: root.user fairShare: <memory:10240, vCores:0>
2017-12-08 11:58:46,283 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,283 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.default is <memory:0, vCores:0>
2017-12-08 11:58:46,283 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.default <memory:0, vCores:0>; Total resource demand for root now <memory:0, vCores:0>
2017-12-08 11:58:46,284 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.user is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,284 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.user is <memory:10240, vCores:0>
2017-12-08 11:58:46,284 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.user <memory:1024, vCores:1>; Total resource demand for root now <memory:1024, vCores:1>
2017-12-08 11:58:46,284 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(144)) - The updated demand for root is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,285 DEBUG [Thread-2] fair.FSQueue (FSQueue.java:setFairShare(293)) - The updated fairShare for root is <memory:10240, vCores:10>
2017-12-08 11:58:46,287 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,287 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.default is <memory:0, vCores:0>
2017-12-08 11:58:46,287 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.default <memory:0, vCores:0>; Total resource demand for root now <memory:0, vCores:0>
2017-12-08 11:58:46,287 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(322)) - The updated demand for root.user is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,287 DEBUG [Thread-2] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(324)) - The updated fairshare for root.user is <memory:10240, vCores:0>
2017-12-08 11:58:46,288 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(133)) - Counting resource from root.user <memory:1024, vCores:1>; Total resource demand for root now <memory:1024, vCores:1>
2017-12-08 11:58:46,288 DEBUG [Thread-2] fair.FSParentQueue (FSParentQueue.java:updateDemand(144)) - The updated demand for root is <memory:1024, vCores:1>; the max is <memory:9223372036854775807, vCores:2147483647>
2017-12-08 11:58:46,288 DEBUG [Thread-2] fair.FSQueue (FSQueue.java:setFairShare(293)) - The updated fairShare for root is <memory:10240, vCores:10>
2017-12-08 11:58:46,289 INFO  [Thread-2] resourcemanager.MockRM (MockRM.java:waitForState(351)) - AppAttempt : appattempt_1512763125850_0001_000001 State is : SCHEDULED Waiting for state : ALLOCATED
2017-12-08 11:58:46,299 INFO  [Thread-2] resourcemanager.MockRM (MockRM.java:waitForState(351)) - AppAttempt : appattempt_1512763125850_0001_000001 State is : SCHEDULED Waiting for state : ALLOCATED
...
{noformat}
In the bad case, it times out waiting for ALLOCATED.

There seems to be a race condition in {{MockRM#launchAM}} between the node heartbeat, which is called asynchronously, and the scheduler update, which is not.  We can easily fix this by adding another node heartbeat after the scheduler update to ensure that the scheduling actually happens.  This isn't a problem in a real cluster because nodes will be periodically heartbeating.

> TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is flakey with FairScheduler
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7645
>                 URL: https://issues.apache.org/jira/browse/YARN-7645
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 3.0.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>
> We've noticed some flakiness in {{TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers}} when using {{FairScheduler}}:
> {noformat}
> java.lang.AssertionError: Attempt state is not correct (timeout). expected:<ALLOCATED> but was:<SCHEDULED>
> 	at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.amRestartTests(TestContainerResourceUsage.java:275)
> 	at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers(TestContainerResourceUsage.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org