You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2015/09/02 10:43:46 UTC

[jira] [Updated] (MAPREDUCE-6415) Create a tool to combine aggregated logs into HAR files

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kanter updated MAPREDUCE-6415:
-------------------------------------
    Attachment: MAPREDUCE-6415_branch-2.002.patch
                MAPREDUCE-6415.002.patch

Thanks for the review [~jlowe]!

The 002 patch address most of the issues Jason brought up:
- fixes dependencies, though I had to keep some of the ones that maven didn't think it needed
- fixes usage output to use variables for the defaults.  I also changed the units for the max total logs size to megabytes instead of bytes to be easier to use.
- now SUCCEEDED and FAILED log aggregation statuses are considered.
- improves checkFiles to be more efficient
- if maxEligible is 0, it will now print out a message and exit right away.  I think having 0 be equivalent to all might be confusing?  I'm fine either way; let me know if you think it's better to treat it as equivalent to a negative value.

I don't think we should add a unique ID to the working directory.  The tool won't work correctly with simultaneous runs anyway because it doesn't acquire any sort of "lock" that would stop another instance from trying to process the same application's logs.  As it is now, by using a non-unique directory, anything left over will get cleaned up when you run the tool again (presumably, you're running it at some interval).

On that last point, it would be good if we could prevent two instances of the tool from running at the same time.  I think the best way to do (without using a lock) is for the tool to check for a RUNNING job named "ArchiveLogs" in the RM, though this won't protect against all situations and will have a false positive if the user has another job named "ArchiveLogs".

> Create a tool to combine aggregated logs into HAR files
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6415
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: HAR-ableAggregatedLogs_v1.pdf, MAPREDUCE-6415.001.patch, MAPREDUCE-6415.002.patch, MAPREDUCE-6415_branch-2.001.patch, MAPREDUCE-6415_branch-2.002.patch, MAPREDUCE-6415_branch-2_prelim_001.patch, MAPREDUCE-6415_branch-2_prelim_002.patch, MAPREDUCE-6415_prelim_001.patch, MAPREDUCE-6415_prelim_002.patch
>
>
> While we wait for YARN-2942 to become viable, it would still be great to improve the aggregated logs problem.  We can write a tool that combines aggregated log files into a single HAR file per application, which should solve the too many files and too many blocks problems.  See the design document for details.
> See YARN-2942 for more context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)