You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "András Győri (Jira)" <ji...@apache.org> on 2022/06/22 12:41:00 UTC

[jira] [Resolved] (YARN-11188) Only files belong to the first file controller are removed even if multiple log aggregation file controllers are configured

     [ https://issues.apache.org/jira/browse/YARN-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

András Győri resolved YARN-11188.
---------------------------------
    Resolution: Fixed

> Only files belong to the first file controller are removed even if multiple log aggregation file controllers are configured
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-11188
>                 URL: https://issues.apache.org/jira/browse/YARN-11188
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Log aggregation can be configured to have a comma-separated list of file controllers.
> The current behaviour only removes files that belong to the first file controller.
> This can be problematic. 
> For example, if some user configures IFile as the file controller, and later on changes the file controllers to specify multiple file controllers (e.g. value = TFile,IFile) then only the first controller will be considered and the files belong to that controller will be removed, in this case files written by the TFile controller will be removed and the files created with the IFile controller will be kept.
> This behaviour should be changed so that all of the files should be removed if multiple file controllers are enabled.
> h2. CODE PATH
> ----
> 1. [AggregatedLogDeletionService$LogDeletionTask#run|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108]: 
> Let's understand what does this method do.
> 1.1 An important bit is to see how the value of the field called 'retentionMillis' is set. In the constructor of LogDeletionTask, there's an incoming parameter called 'retentionSecs' that is just multiplied by 1000 to have a millisecond value.
> Let's see where 'retentionSecs' is coming from.
> 1.2 [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L283] that sets the value of retentionSecs.
> The config key for this value is 'yarn.log-aggregation.retain-seconds'.
> The javadoc says: "How long to wait before deleting aggregated logs, -1 disables. Be careful set this too small and you will spam the name node."
> 1.3 Going back to [https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108], the 'cutOffMillis' value is computed by getting the current time in millis minus the retentionMillis.
> 1.4 The main point of this method is to iterate over the files in the remote root log dir (field called 'remoteRootLogDir') and to check if it is a directory. If so, a new Path is created with that particular directory ([code link|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L90-L96]).
> One more important thing to mention: There's a field called 'suffix' that is added to the remote root log dir path.
> Let's check how the 'remoteRootLogDir' and 'suffix' field get its value as this is crucial to understand how the log dirs are deleted.
> 1.5 remoteRootLogDir is set in the constructor of LogDeletionTask, [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L77].
> The value is returned by calling fileController.getRemoteRootLogDir().
> The LogAggregationFileControllerFactory creates the instance of LogAggregationFileController.
> ----
> *The process of determining the log aggregation file controller is quite messy, let me describe this in detail.*
> *There are 2 types of file controllers: LogAggregationIndexedFileController and LogAggregationTFileController*
> *There's a testcase called [TestLogAggregationFileControllerFactory#testLogAggregationFileControllerFactory|#testLogAggregationFileControllerFactory] that shows how the LogAggregationFileControllerFactory is configured.*
> 2.1 First, some important configs:
> 2.1.1 Generic config key for the log aggregation file controller class: 
> yarn.log-aggregation.file-controller.<controllerName>.class
> An example real-world config key: 
> yarn.log-aggregation.file-controller.IFile.class
> An example real-world config value: LogAggregationFileController.class
> 2.1.2 Generic config key for the log aggregation file controller's remote app log dir: 
> yarn.log-aggregation.<controllerName>.remote-app-log-dir
> An example real-world config key: yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world config value: /tmp/logs/IFile/
> 2.1.3 Generic config key for the log aggregation file controller's remote app log dir suffix: 
> yarn.log-aggregation.<controllerName>.remote-app-log-dir-suffix
> An example real-world config key: 
> yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> 2.1.4 There's one more config called 'yarn.log-aggregation.file-formats', that can store a comma separated list of file controllers.
> An example value: IFile,TFile
> 2.2 Let's examine how the [LogAggregationFileControllerFactory's contstructor|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L63-L80] works.
> 2.2.1 There's [an iteration|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L69] over file controllers.
> 2.2.2 
> The remote app log dir per file controller is [read from the config|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L196-L216]
> An example for a config key: yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world value of this config: /tmp/logs/IFile/
> 2.2.3 If the specified remote app log dir is null or empty, the remote dir for the particular file controller falls back to the NM's log dir.
> The log dir is either specified by the config 'yarn.nodemanager.remote-app-log-dir' or falls back to the default path '/tmp/logs'.
> This logic is implemented [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L208-L215]
> 2.2.4 Next, the remote app log dir suffix is read [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L225-L232].
> Example config key: yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> If the suffix is null or empty, the suffix is tried to read by the value of config key 'yarn.nodemanager.remote-app-log-dir-suffix' or if it's not specified still, the default prefix will be 'logs'.
> 2.2.5 Now we now the remoteDir (/tmp/logs/IFile/) + the suffix (IFile), we just concatenate them and add a hyphen in between, so the final value will be: target/app-logs/IFile/-IFile [TODO]
> 2.2.6 The rest of the method reads the log aggregation file controller's class name and initializes the controller. This is implemented [here|hhttps://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L82-L95].
> An example config key for the class: 'yarn.log-aggregation.file-controller.IFile.class'
> An example value of this config: "org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController"
> 2.2.7 Next, the controller is created by creating a new instance of the class with reflection.
> 2.2.8 An important bit is to [initialize the controller|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L77]
> 2.2.9 The initialize method [is implemented in LogAggregationFileController|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L121-L140], which is an abstract base class for the file controllers.
> 2.2.10 The remote root log dir + the suffix [is read|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L136-L137] by the same config logic as described above.
> 2.2.11 As a final step, the controller instance is [added to the factory's controllers list|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L78]
> 2.2.3 Now we know how the LogAggregationFileControllerFactory works and how it reads the config to create and store the File controller instances.
> Let's jump back to the constructor of org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService.LogDeletionTask#LogDeletionTask.
> The file controller is determined by calling the 'getFileControllerForWrite' method on the LogAggregationFileControllerFactory instance, [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L75].
> 2.2.4 [The method|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L128] is quite simple, it just returns the first element from the list, so if multiple log aggregation file controllers were instantiated during the initialization (as per the config), always the first instance will be returned here.
> ----
> *WE need to jump back to step 1.4 and 1.5, where the files are being listed with the help of the abstract FileSystem implementation [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L93-L97].*
> *So we know how the values for 'remoteRootLogDir' and 'suffix' are set as described in detail above.*
> ----
> 1.6 Let's see what the deleteOldLogDirsFrom method does since this is the main call of the loop that lists the log dirs.
> [The method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L110-L122] is very simple: It accepts a Path as a parameter (which we know that it is a directory), it lists the dirs from this main directory and iterates over the dirs and [calls deleteAppDirLogs|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L120].
> 1.7 The [deleteAppDirLogs method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L124-L165] is quite messy again.
> 1.7.1 Parameters are: 
> cutOffMillis: The 'cutOffMillis' value is computed by getting the current time in millis minus the retentionMillis that is coming from the configuration.
> If it's set to 2 minutes, the calculated time will be NOW-2 minutes in milliseconds.
> fs: The abstract FileSystem implementation
> rmClient: Not important for us right now
> appDir: The directory to clean up
> 1.7.2 The whole method only does anything useful if the directory's modification time < cutOffMillis. What this means in practice is that only the dirs that are modified earlier than the retention period will be touched / deleted.
> 1.7.3 If the app is not terminated, we list the directory and try to remove the log files. Only the log files will be deleted that are having a modification time which is earlier than the retention period.
> [This is the logic|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L133-L152] that implements this.
> 1.7.4 [The other part of the if condition|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L152-L160] tries to delete the log dir, but checks if the return value of 'shouldDeleteLogDir' is true, first.
> 1.7.5 Let's check the method [AggregatedLogDeletionService.LogDeletionTask#shouldDeleteLogDir|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L167-L182]: 
> This is basically the same logic as the other retention period based logic that I described above.
> We set shouldDelete to true by default, then set it to false only if the modification date of the dir itself is later than the timestampt that is defined by the retention period.
> ----
> h2. CONCLUSION
> *We just checked the implementation of how the log aggregation file controllers are instantiated and configured.*
> *Just by reading the code + the logic, I think reading / parsing the configuration is okay.*
> *What really bothers me is how the file controller instance is getting created by the factory (step 2.2.3).*
> *If multiple log aggregation file controllers (TFile + IFile) are configured, always the 0th item (first) will be picked by the factory. This is resulting in the incorrect behaviour so that only one controller's files will be cleaned up.*
> *As the [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L277] method just creates the LogDeletionTask instance once and schedules it on a fixed rate with the help of a Timer, there's no distinction in log aggregation File controllers on this abstraction, meaning that only the LogAggregationFileControllerFactory could return different file controllers.*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org