You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Zhanqi Cai (Jira)" <ji...@apache.org> on 2021/04/16 11:59:00 UTC

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

     [ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhanqi Cai updated YARN-10739:
------------------------------
    Attachment: Queue_Details.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10739
>                 URL: https://issues.apache.org/jira/browse/YARN-10739
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.4.0, 3.3.1, 3.2.3
>            Reporter: Zhanqi Cai
>            Priority: Critical
>         Attachments: YARN-10739-001.patch
>
>
> Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on AsyncDispatcher, if the event queue size is too large, the printEventQueueDetails will cost too much time and RM  take long time to process.
> For example:
> If we have 4K nodes on cluster and 4K apps running, if we do switch and the nodemanger will register with RM, and RM will call NodesListManager to do RMAppNodeUpdateEvent, code like below:
> for(RMApp app : rmContext.getRMApps().values()) {
>  if (!app.isAppFinalStateStored()) {
>  this.rmContext
>  .getDispatcher()
>  .getEventHandler()
>  .handle(
>  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>  appNodeUpdateType));
>  }
> So the total event is 4k*4k=1600W, during this window, the GenericEventHandler.printEventQueueDetails will print the event queue detail and be called frequently, once the event queue size reach to 100W+, the Iterator of queue from printEventQueueDetails will be so slow refer to below:
> private void printEventQueueDetails() {
>  Iterator<Event> iterator = eventQueue.iterator();
>  Map<Enum, Long> counterMap = new HashMap<>();
>  while (iterator.hasNext()) {
>  Enum eventType = iterator.next().getType();
> Then RM recovery will cost too much time.....



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org