You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Dick King (JIRA)" <ji...@apache.org> on 2010/07/24 02:51:51 UTC

[jira] Created: (MAPREDUCE-1967) When a reducer fails on DFS quota, the job should fail immediately

When a reducer fails on DFS quota, the job should fail immediately
------------------------------------------------------------------

                 Key: MAPREDUCE-1967
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1967
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Dick King


Suppose an M/R job has so much output that the user is certain to exceed hir quota.  Then some of the reducers will succeed but the job will get into a state where the remaining reducers squabble over the remaining space.  The remaining reducers will nibble at the remaining space, and finally one reducer will fail on quota.  Its output file will be erased, and the other reducers will collectively consume that space until one of _them_ fails on quota.  Since the incomplete reducer that fails on quota is "chosen" randomly, the tasks will accumulate their failures at similar rates, and the system will have made a substantial futile investment.

I would like to say that if a single reducer fails on DFS quota, the job should be failed.  There may be a corner case that induces us to think that we shouldn't be quite this stringent, but at least we shouldn't have to await four failures by one task before shutting the job down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1967) When a reducer fails on DFS quota, the job should fail immediately

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892342#action_12892342 ] 

Doug Cutting commented on MAPREDUCE-1967:
-----------------------------------------

Perhaps this could be generalized so that there's a set of exceptions that are considered job-killing?  Quota exceptions might be in the set by default, but others might be added.

> When a reducer fails on DFS quota, the job should fail immediately
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1967
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1967
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dick King
>
> Suppose an M/R job has so much output that the user is certain to exceed hir quota.  Then some of the reducers will succeed but the job will get into a state where the remaining reducers squabble over the remaining space.  The remaining reducers will nibble at the remaining space, and finally one reducer will fail on quota.  Its output file will be erased, and the other reducers will collectively consume that space until one of _them_ fails on quota.  Since the incomplete reducer that fails on quota is "chosen" randomly, the tasks will accumulate their failures at similar rates, and the system will have made a substantial futile investment.
> I would like to say that if a single reducer fails on DFS quota, the job should be failed.  There may be a corner case that induces us to think that we shouldn't be quite this stringent, but at least we shouldn't have to await four failures by one task before shutting the job down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1967) When a reducer fails on DFS quota, the job should fail immediately

Posted by "Dick King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892397#action_12892397 ] 

Dick King commented on MAPREDUCE-1967:
--------------------------------------

I certainly agree with [Doug's comment of 7/26/10 12:12|https://issues.apache.org/jira/browse/MAPREDUCE-1967?focusedCommentId=12892342&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12892342] .  I invite him and others to submit proposed exceptions.

Having said that, a DFS quota overflow is worse than most.  Mapreduce prefers to reschedule failed tasks, so if a task gets failed because of a bug in its code that is triggered by its split, its failed retries will happen sooner rather than later [although non-locally, at least on one of the four tries].  In the case of DFS quotas, the underlying cause of the failure doesn't go away, but the retry is likely to succeed if some other task blows quota and releases its space, so this one might succeed.


Do people out there think we should have a one-strike policy or maybe a small number, like five?

> When a reducer fails on DFS quota, the job should fail immediately
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1967
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1967
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dick King
>
> Suppose an M/R job has so much output that the user is certain to exceed hir quota.  Then some of the reducers will succeed but the job will get into a state where the remaining reducers squabble over the remaining space.  The remaining reducers will nibble at the remaining space, and finally one reducer will fail on quota.  Its output file will be erased, and the other reducers will collectively consume that space until one of _them_ fails on quota.  Since the incomplete reducer that fails on quota is "chosen" randomly, the tasks will accumulate their failures at similar rates, and the system will have made a substantial futile investment.
> I would like to say that if a single reducer fails on DFS quota, the job should be failed.  There may be a corner case that induces us to think that we shouldn't be quite this stringent, but at least we shouldn't have to await four failures by one task before shutting the job down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.