You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "MaoYuan Xian (JIRA)" <ji...@apache.org> on 2013/08/12 06:34:47 UTC

[jira] [Created] (HAMA-793) Job failed to recovery when more than one tasks fail at the same time even when fault tolerant enabled.

MaoYuan Xian created HAMA-793:
---------------------------------

             Summary: Job failed to recovery when more than one tasks fail at the same time even when fault tolerant enabled.
                 Key: HAMA-793
                 URL: https://issues.apache.org/jira/browse/HAMA-793
             Project: Hama
          Issue Type: Bug
          Components: bsp core
    Affects Versions: 0.6.2
            Reporter: MaoYuan Xian
            Priority: Minor


I can find the fault tolerant does not work when more than one tasks fail at the same time during a job running.

The reason is, in the schedule method of SimpleTaskScheduler, when finds the jobresult equals to false, job.kill called, and than JobInProgress.garbageCollection triggered, job directory is clean and makes the recovery job fail.

I made the following modifications in the SimpleTaskScheduler and avoid the problem. But not sure whether it is the comprehensive solution:
{code}
-      if (Boolean.FALSE.equals(jobResult)) {
+      if ((Boolean.FALSE.equals(jobResult))
+          && (job.getStatus().getRunState() != JobStatus.RECOVERING)) {
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira