You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/03/18 15:06:38 UTC

[jira] [Updated] (MAPREDUCE-6277) Job can post multiple history files if attempt loses connection to the RM

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-6277:
----------------------------------
    Target Version/s: 2.7.0
             Summary: Job can post multiple history files if attempt loses connection to the RM  (was: Job In Error State Will Lost Jobhistory Of Second and Later Attempts)

Thanks for the patch, Chang.  Looks good overall, just some comments on the test:

The test sets the wait interval to 1ms but I notice it doesn't loop to try.  Theoretically we could race through this code in the same millisecond and the test will fail for the wrong reasons.  We should either set the retry interval to 0 so it always fails even on the first try or introduce a small sleep (e.g.: 10 msec) after initializing the object but before calling schedule.

Rather than returning in the middle of the test it would be cleaner to handle it this way:

{code}
    try {
      allocator.schedule();
      Assert.fail("Should Have Exception");
    } catch (YarnRuntimeException e) {
      Assert.assertTrue(e.getMessage().contains("Could not contact RM after"));
    }
    dispatcher.await();
    Assert.assertEquals("Should Have 1 Job Event", 1,
    [...]
{code}

Nit: The lack of indentation on this continued line makes the code harder to read:
{code}
      Assert.assertEquals("Should Have 1 Job Event", 1,
      allocator.jobEvents.size());
{code}

> Job can post multiple history files if attempt loses connection to the RM
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6277
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6277
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.7.0
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: YARN-3335.1.patch, YARN-3335.2.patch
>
>
> Related to a fixed issue MAPREDUCE-6230 which cause a Job to get into error state. In that situation Job's second or some later attempt could succeed but those later attempts' history file will all be lost. Because the first attempt in error state will copy its history file to intermediate dir while mistakenly think of itself as lastattempt. Jobhistory server will later move the history file of that error attempt from intermediate dir to done dir while ignore the rest of that job attempt's later history files in intermediate dir.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)