You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org> on 2012/01/26 21:17:41 UTC

[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194160#comment-13194160 ] 

Robert Joseph Evans commented on MAPREDUCE-3711:
------------------------------------------------

I am not going to have to dig into the recovery process a bit and possibly benchmark it myself, but on the surface it looks like it is just taking a very long time to read in all of the jhist file and replay it.

If this is true, then we probably need some way to reduce the amount of data that we are reading in and needing to replay.  We could perhaps fix it by also writing out a smaller checkpoint file, that will have a summary with only what is needed for recovery, and a pointer to where in the JHist file to start from.
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira