You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Rahul Jain (JIRA)" <ji...@apache.org> on 2012/08/16 01:16:37 UTC
[jira] [Commented] (MAPREDUCE-4559) Job logs not accessible through job history server for AM killed due to am.liveness-monitor expiry

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435603#comment-13435603 ] 

Rahul Jain commented on MAPREDUCE-4559:
---------------------------------------

In our lab tests, we have a namenode that performs much slowly than regular namenodes being on a slower network.

This caused application master to take long time committing job (the last step after all map, reduces were done); and in some cases, was greater than 10 minutes, which is more than the yarn.am.liveness-monitor.expiry-interval-ms (default 10 minutes).

Here is the RM logs snippet: 
{code}

{code}
05_0002    CONTAINERID=container_1344459886205_0002_01_000825
2012-08-08 23:57:34,881 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1344459886205_0002_01_000825 of capacity memory: 4096 on host sjc1-spr-msip-grid08.sjc1.carrieriq.com:26020, which currently has 0 containers, memory: 0 used and memory: 80000 available, release resources=true
2012-08-08 23:57:34,881 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Application appattempt_1344459886205_0002_000001 released container container_1344459886205_0002_01_000825 on node: host: sjc1-spr-msip-grid08.sjc1.carrieriq.com:26020 #containers=0 available=80000 used=0 with event: FINISHED
2012-08-09 00:08:10,256 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:appattempt_1344459886205_0002_000001 Timed out after 600 secs
2012-08-09 00:08:10,256 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1344459886205_0002_000001 State change from RUNNING to FAILED
2012-08-09 00:08:10,256 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1344459886205_0002 failed 1 times due to . Failing the application.
2012-08-09 00:08:10,257 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1344459886205_0002 State change from RUNNING to FAILED
{code}


On the application master:

{code}
2012-08-08 23:57:33,871 INFO [ContainerLauncher #13] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1344459886205_0002_01_000825 taskAttempt attempt_1344459886205_0002_m_000754_02012-08-08 23:57:33,871 INFO [ContainerLauncher #13] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING a
ttempt_1344459886205_0002_m_000754_02012-08-08 23:57:33,874 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt
_1344459886205_0002_m_000754_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED2012-08-08 23:57:33,874 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
 with attempt attempt_1344459886205_0002_m_000754_0
2012-08-08 23:57:33,875 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_134445988
6205_0002_m_000754 Task Transitioned from RUNNING to SUCCEEDED
2012-08-08 23:57:33,875 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed T
asks: 8622012-08-09 00:08:10,263 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling 
RMCommunicator and JobHistoryEventHandler.2012-08-09 00:08:10,263 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that 
iSignalled was : true2012-08-09 00:08:10,263 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler not
ified that isSignalled was true
2012-08-09 00:08:10,263 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 02012-08-09 00:08:10,263 INFO [Thread-50] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning
{code}

We'd expect such failure conditions to be the most important for job history availability.

The job history can still be accessed manually thru the aggregated logs on hdfs, but job history server has no idea about the above job after timeout.

                
> Job logs not accessible through job history server for AM killed due to am.liveness-monitor expiry
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4559
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4559
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira