You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Peter Bacsko (JIRA)" <ji...@apache.org> on 2018/02/05 14:30:00 UTC
[jira] [Created] (MAPREDUCE-7048) AM can still crash after MAPREDUCE-7020

Peter Bacsko created MAPREDUCE-7048:
---------------------------------------

             Summary: AM can still crash after MAPREDUCE-7020
                 Key: MAPREDUCE-7048
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7048
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mr-am
    Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


The testcase TestUberAM#testThreadDumpOnTaskTimeout was supposed to be fixed by MAPREDUCE-7020. However, it still fails, see: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7325/testReport/junit/org.apache.hadoop.mapreduce.v2/TestMRJobs/testThreadDumpOnTaskTimeout/ (note: other tests failed as well, but those look unrelated).

When I tried to reproduce it locally, it failed again, although with a slightly different error message (it was actually the same as before):

{noformat}
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.mapreduce.v2.TestUberAM
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 128.192 s <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.TestUberAM
[ERROR] testThreadDumpOnTaskTimeout(org.apache.hadoop.mapreduce.v2.TestUberAM)  Time elapsed: 79.539 s  <<< FAILURE!
java.lang.AssertionError: No AppMaster log found! expected:<1> but was:<2>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:555)
	at org.apache.hadoop.mapreduce.v2.TestMRJobs.testThreadDumpOnTaskTimeout(TestMRJobs.java:1228)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}

*Root cause:* {{System.exit()}} is still invoked at {{Task.statusUpdate()}}

{noformat}
  public void statusUpdate(TaskUmbilicalProtocol umbilical) 
  throws IOException {
    int retries = MAX_RETRIES;
    while (true) {
      try {
        if (!umbilical.statusUpdate(getTaskID(), taskStatus).getTaskFound()) {
          LOG.warn("Parent died.  Exiting "+taskId);
          System.exit(66);
        }
        taskStatus.clearStatus();
        return;
        ...
{noformat}

At this point, the task was not found and return value of {{umbilical.statusUpdate()}} is false. Checking whether we run in uber mode seems to solve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org