You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (JIRA)" <ji...@apache.org> on 2018/08/02 15:27:00 UTC

[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

    [ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566890#comment-16566890 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--------------------------------------------------------------

DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node status update
4. If there are any log aggregation reports then {{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state (RMAppEventType.ATTEMPT_FINISHED), no matter what happens in {{FinalTransition}}, the app will reach a terminal state (FINISHED in this case).
If I would use a break statement as described above, the app would be in a FINISHED state which is not right as the rest of the code in the transition could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as before and if log aggregation is not finished yet, I don't send the APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, {{RMAppImpl.aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node status update
4. If there is any log aggregation reports then RmNode.handleLogAggregationStatus is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state (RMAppEventType.ATTEMPT_FINISHED), no matter what happens in {{FinalTransition}}, the app will reach a terminal state (FINISHED in this case).
If I would use a break statement as described above, the app would be in a FINISHED state which is not right as the rest of the code in the transition could not run again.
So with my implementation, I run all the code in {{FinalTransition}} as before and if log aggregation is not finished yet, I don't send the APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, {{RMAppImpl.aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each Yarn App into a HAR file.  When run, it seeds the list by looking at the aggregated logs directory, and then filters out ineligible apps.  One of the criteria involves checking with the RM that an Application's log aggregation status is not still running and has not failed.  When the RM "forgets" about an older completed Application (e.g. RM failover, enough time has passed, etc), the tool won't find the Application in the RM and will just assume that its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed from its history) until the aggregation status has reached a terminal state (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org