You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "LongGang Chen (JIRA)" <ji...@apache.org> on 2018/05/24 08:55:00 UTC

[jira] [Created] (YARN-8355) container update error because of competition

LongGang Chen created YARN-8355:
-----------------------------------

             Summary: container update error because of competition
                 Key: YARN-8355
                 URL: https://issues.apache.org/jira/browse/YARN-8355
             Project: Hadoop YARN
          Issue Type: Bug
          Components: RM
    Affects Versions: 3.0.x
            Reporter: LongGang Chen


first, Quickly go through the update logic, Increase as an example:

step 1: normal work in ApplicationMasterService, DefaultAMSProcessor.
step 2: CapacityScheduler.allocate will call AbstractYarnScheduler.handleContainerUpdates
step 3: AbstractYarnScheduler.handleContainerUpdates will call handleIncreaseRequests, then call ContainerUpdateContext.checkAndAddToOutstandingIncreases
step 4: cancle && and new: checkAndAddToOutstandingIncreases will check this inc update for this container, if there is an outstanding inc, it will cancle it by calling appSchedulingInfo.allocate(...) to allocate a dummy container; if the update is a fresh one, it will call appSchedulingInfo.updateResourceRequests to add a new request. the capacity of this new request is gap value between exiting rmContainer and capacity of updateRequest, for example, if original capacity is <memory:10GB>, the target capacity of UpdateRequest is <memory:20GB>, the gap[the capacity of the new request which will be added to appSchedulingInfo] is <memory:10GB>.

step 5: swap temp container and existing container: CapacityScheduler.allocate call FiCaSchedulerApp.getAllocation(...), getAllocation will call SchedulerApplicationAttempt.pullNewlyIncreasedContainers, then call ContainerUpdateContext.swapContainer,swapContainer will swap the newly allocated inc temp container with existing container, for example: original capacity <memory:10GB>, temp inc container's capacity <memory:10GB>, so the updated existing container has capacity <memory:10+10=20GB>,inc update done.

the problem is:
if we send inc update twice for a certain container, for example: send inc <memory:10> to <memory:12>, then send inc with new target <memory:14>, the final updated capacity is uncertain.

Scenes one:
1: send inc update from <memory:10> to <memory:12>
2: scheduler aprove it, and commit it, so app.liveContainers has this temp inc container with capacity<memory:2> in it.
3: send inc with new target <memory:14>, a new resourceRequest with capacity<memory:4> will add to appSchedulingInfo, and swap first temp container<memory:2>, after that, the existing container has new capacity<memory:12>
4: scheduler aprove the send temp reqourceRequest, allocate second temp container with capacity<memory:4>
5: swap the second inc temp container. so the updated capacity of this existing container is <memory:4+12> = <memory:16>, but wanted is <memory:14>

Scenes two:
1: send send inc update from <memory:10> to <memory:12>
2: scheduler aprove it, but the temp container with capacity<memory:2> is queued in commitService, wait to commit
3: send inc with new target <memory:14>, will add a new resourceRequest to appSchedulingInfo, but with same SchedulerRequestKey.
4: the first temp container commit, app.apply will call appSchedulingInfo.allocate to reduce pending num, at this situation, it will cancle the second inc request.
5: swap the first int temp container. the updated existing container's capacity is <memory:12>, but the wanted is <memory:14>

two key points:
1: when ContainerUpdateContext.checkAndAddToOutstandingIncreases cancle previous inc and put current inc request, it use same SchedulerRequestKey as before, this action has competition with app.apply, like scenes two, app.apply will cancle second inc update's request.

2: ContainerUpdateContext.swapContainer do not check the update target change or not.

how to fix:
1: after ContainerUpdateContext.checkAndAddToOutstandingIncreases cancle previous inc update, use a new SchedulerRequestKey for current inc update. we can add a new field createTime to distinguish them, default value of createTime is 0
2: change ContainerUpdateContext.swapContainer to checkAndSwapContainer, check update target change or not, if change, just ignore this temp container and release it. like Scenes one, when we swap first temp inc container, wo found that if we do this swap, the updated capacity is <memory:12>, but the newly target's capacity is <memory:14>, so we just ignore this swap, and release the temp container<memory:2>.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org