You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/07/22 01:08:05 UTC

[jira] [Commented] (MAPREDUCE-6415) Create a tool to combine aggregated logs into HAR files

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635992#comment-14635992 ] 

Jason Lowe commented on MAPREDUCE-6415:
---------------------------------------

Note that container IDs are not guaranteed to be consecutive nor are they guaranteed to start at 1 for the AM.  Due to how reservations are processed and other race conditions, a container ID may not actually correspond to a physically launched container.  For example, on our busy clusters it is not rare for the AM container to have an ID greater than 000001.  So the danger here is that if the RM ends up skipping one or more container IDs when handing out containers to the application then we will skip one or more applications to aggregate.  We'll get another crack at it on the next pass, but again on a busy cluster we could fairly consistently fail to hit a number of them and we could have indefinite postponement on the aggregation of some applications (especially the first few in the list).

A more robust approach would be to have the distributed shell explicitly set something in the container's environment that is a sequence number from the distributed shell's point of view.  In other words, regardless of what container ID is allocated, the distributed shell can set a monotonically increasing number in each new container's env that the script can leverage to do instance-specific behavior.  This is akin to the task ID in MapReduce which again is disconnected from YARN's container ID.


> Create a tool to combine aggregated logs into HAR files
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6415
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: HAR-ableAggregatedLogs_v1.pdf, MAPREDUCE-6415_branch-2_prelim_001.patch, MAPREDUCE-6415_prelim_001.patch
>
>
> While we wait for YARN-2942 to become viable, it would still be great to improve the aggregated logs problem.  We can write a tool that combines aggregated log files into a single HAR file per application, which should solve the too many files and too many blocks problems.  See the design document for details.
> See YARN-2942 for more context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)