You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2007/04/30 17:55:15 UTC

[jira] Created: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

MAX_TASK_FAILURES should be configurable
----------------------------------------

                 Key: HADOOP-1304
                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.12.3
            Reporter: Christian Kunz


After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492936 ] 

Andrzej Bialecki  commented on HADOOP-1304:
-------------------------------------------

+1 for per-job Expert configuration, with defaults set to 0 failures.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493347 ] 

Hadoop QA commented on HADOOP-1304:
-----------------------------------

Integrated in Hadoop-Nightly #77 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/77/)

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>             Fix For: 0.13.0
>
>         Attachments: 1304.patch, 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1304:
---------------------------------

    Comment: was deleted

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492739 ] 

Hadoop QA commented on HADOOP-1304:
-----------------------------------

-1, could not apply patch.

The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12356525/1304.patch as a patch to trunk revision r533233.

Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/94/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492935 ] 

Arun C Murthy commented on HADOOP-1304:
---------------------------------------

I'd think the right direction is to make them per-job, so only users who care for higher/lower values get them and let it be usual defaults for others who don't care...

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1304:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Devaraj.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>             Fix For: 0.13.0
>
>         Attachments: 1304.patch, 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492729 ] 

Doug Cutting commented on HADOOP-1304:
--------------------------------------

I think this is a nice complement to HADOOP-1114.  Perhaps these issues should be merged, even.  The maximum number of attempts per task can be set with mapred.map.max.attempts and mapred.reduce.max.attempts.  Once the attempt limit has been exceeded, then the task can be counted as a failure and the relevant failure percentage of HADOOP-1114 would determine whether the job proceeds.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492768 ] 

Doug Cutting commented on HADOOP-1304:
--------------------------------------

Lucene uses "Expert:" in the javadoc to mark features that most users should not need.  For example, search for "Expert" on http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html

In Hadoop we should perhaps adopt this convention, placing "Expert:" at the start of javadoc and at the start of descriptions in hadoop-default.xml.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Status: Patch Available  (was: Open)

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das reassigned HADOOP-1304:
-----------------------------------

    Assignee: Devaraj Das

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

Here is  somewhat different but related issue:

it would be useful to make the framework distinguish between  
deterministic and non-deterministic failures and react differently to  
them.

E.g.
-- in streaming, a Perl script has a syntax error.  There is no need  
to check for this 4*300 times.
-- the same exception (with the same stack) is thrown while  
processing the same record.  (G's MapReduce supposedly is capable to  
skip the offending record at the next attempt, but short of that, why  
keep trying?)

(Of course this is just an optimization, while 1304 is a  
functionality one cannot do without....)

-- ab

On Apr 30, 2007, at 12:34 PM, Arun C Murthy (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/HADOOP-1304? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12492766 ]
>
> Arun C Murthy commented on HADOOP-1304:
> ---------------------------------------
>
> One concern with this 'feature' is that we want a reasonable cap on  
> what the user can set max attempts to, else we could have a  
> situation where a user unknowingly, not maliciously, sets it a very  
> large value - thus the framework is now vulnerable to one wrongly  
> configured job hogging the cluster...
>
> Also, as per a discussion with Doug we could follow lucene's  
> convention of classifying this knob as 'Expert' so as to clearly  
> elucidate it's importance...
>
>> MAX_TASK_FAILURES should be configurable
>> ----------------------------------------
>>
>>                 Key: HADOOP-1304
>>                 URL: https://issues.apache.org/jira/browse/ 
>> HADOOP-1304
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: mapred
>>    Affects Versions: 0.12.3
>>            Reporter: Christian Kunz
>>         Assigned To: Devaraj Das
>>         Attachments: 1304.patch, 1304.patch
>>
>>
>> After a couple of weeks of failed attempts I was able to finish a  
>> large job only after I changed MAX_TASK_FAILURES to a higher  
>> value. In light of HADOOP-1144 (allowing a certain amount of task  
>> failures without failing the job) it would be even better if this  
>> value could be configured separately for mappers and reducers,  
>> because often a success of a job requires the success of all  
>> reducers but not of all mappers.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492766 ] 

Arun C Murthy commented on HADOOP-1304:
---------------------------------------

One concern with this 'feature' is that we want a reasonable cap on what the user can set max attempts to, else we could have a situation where a user unknowingly, not maliciously, sets it a very large value - thus the framework is now vulnerable to one wrongly configured job hogging the cluster...

Also, as per a discussion with Doug we could follow lucene's convention of classifying this knob as 'Expert' so as to clearly elucidate it's importance...

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Status: Patch Available  (was: Open)

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492838 ] 

Devaraj Das commented on HADOOP-1304:
-------------------------------------

Pasting Nigel's comments:
-----------
Some comments on the JobConf methods:
1) why no setter methods?
2) why are getters getNumMax... instead of getMaxNum...?
3) can this parameters be set per job? per instantiation of the JobTracker?
4) the @return javadoc comment should not start with the return type
5) the javadoc should document the default value (IMO)
6) the javadoc should document the property this method relates to
7) remove the word "we" from the javadoc comments -- such as "Get the configured number of maximum attempts that will be made to run a map task, as specified by the <code>mapred.map.max.attempts</code>
property. If this property is not already set, the default is 4 attempts."
-----------

I think these config items should be per jobtracker instantiation to control the potential abuse. In that case, it means that job submitters cannot tweak these, and it need not be part of the jobconf at all. Do others agree with me on this?




> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Attachment: 1304.patch

Attached is a straightforward patch. Two new config items have been introduced - mapred.map.max.failures and mapred.reduce.max.failures (both have default values of 4).

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Attachments: 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493086 ] 

Hadoop QA commented on HADOOP-1304:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356628/1304.patch applied and successfully tested against trunk revision r534234.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/101/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/101/console

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by Nigel Daley <nd...@yahoo-inc.com>.

Some comments on the JobConf methods:
1) why no setter methods?
2) why are getters getNumMax... instead of getMaxNum...?
3) can this parameters be set per job? per instantiation of the  
JobTracker?
4) the @return javadoc comment should not start with the return type
5) the javadoc should document the default value (IMO)
6) the javadoc should document the property this method relates to
7) remove the word "we" from the javadoc comments -- such as "Get the  
configured number of maximum attempts that will be made to run a map  
task, as specified by the <code>mapred.map.max.attempts</code>  
property. If this property is not already set, the default is 4  
attempts."

On Apr 30, 2007, at 12:47 PM, Devaraj Das (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/HADOOP-1304? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Devaraj Das updated HADOOP-1304:
> --------------------------------
>
>     Attachment: 1304.patch
>
> Attached is another patch with the JobConf accessor methods.
> Regarding the potential problem that Arun raised, yes, the problem  
> exists but the hope is that HADOOP-785 will address these  
> vulnerabilities.
>
>> MAX_TASK_FAILURES should be configurable
>> ----------------------------------------
>>
>>                 Key: HADOOP-1304
>>                 URL: https://issues.apache.org/jira/browse/ 
>> HADOOP-1304
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: mapred
>>    Affects Versions: 0.12.3
>>            Reporter: Christian Kunz
>>         Assigned To: Devaraj Das
>>         Attachments: 1304.patch, 1304.patch, 1304.patch
>>
>>
>> After a couple of weeks of failed attempts I was able to finish a  
>> large job only after I changed MAX_TASK_FAILURES to a higher  
>> value. In light of HADOOP-1144 (allowing a certain amount of task  
>> failures without failing the job) it would be even better if this  
>> value could be configured separately for mappers and reducers,  
>> because often a success of a job requires the success of all  
>> reducers but not of all mappers.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Attachment: 1304.patch

Attached is another patch with the JobConf accessor methods.
Regarding the potential problem that Arun raised, yes, the problem exists but the hope is that HADOOP-785 will address these vulnerabilities.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492738 ] 

Doug Cutting commented on HADOOP-1304:
--------------------------------------

This looks good.  Should we add accessor methods to JobConf?

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492731 ] 

Doug Cutting commented on HADOOP-1304:
--------------------------------------


I think this is a nice complement to HADOOP-1144. Perhaps these issues should be merged, even. The maximum number of attempts per task can be set with mapred.map.max.attempts and mapred.reduce.max.attempts. Once the attempt limit has been exceeded, then the task can be counted as a failure and the relevant failure percentage of HADOOP-1144 would determine whether the job proceeds. 

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Attachment: 1304.patch

Looks like there was a race condition between my uploading the patch and Doug putting a comment on it *smile* . Attached is a patch with the suggestion Doug made regarding the name of the config item. Doug, do you think we can do without merging this issue with HADOOP-1144 (since this is a straightforward patch and independent of the other issue).

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1304:
---------------------------------

    Status: Open  (was: Patch Available)

If these parameters are not per-job, then, yes, it makes no sense to add JobConf methods.  Long-term it may make sense to make these per-job, since some jobs may be less reliable than others, requiring more retries.  But simply making this configurable is a step in the right direction.

The current patch reads values from the JobTracker's configuration, not from the job's, yet it includes JobConf setters & getters.  So we should either (a) remove the JobConf methods from this patch; or (b) change it so these are per-job, and add "Expert:" methods to JobConf.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1304:
--------------------------------

    Attachment: 1304.patch

This patch is with ("Expert:") setters in the JobConf for the config items mapred.{map/reduce}.max.attempts. Now users can use the APIs in JobConf to set the values of these config items when they submit jobs, and the framework reads those values from the user's JobConf on a per-job basis.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Devaraj Das
>         Attachments: 1304.patch, 1304.patch, 1304.patch, 1304.patch
>
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492713 ] 

Devaraj Das commented on HADOOP-1304:
-------------------------------------

I think HADOOP-1144 only allows tolerating certain percentage of *map* failures. *All* the reduce tasks are supposed to successfully execute for the job to succeed. MAX_TASK_FAILURES signifies the max *attempts* that the framework would make per task (as opposed to max failures considering all the tasks for the job). So unless you are saying that we should have fewer attempts for a single map (i.e., less than the hardcoded 4 attempts) and more for reduces, i don't see the need for having two different config values. Am i missing something here?

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1304) MAX_TASK_FAILURES should be configurable

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492718 ] 

Christian Kunz commented on HADOOP-1304:
----------------------------------------

Assuming that HADOOP-1144 is implemented for map failures only (which is okay with me although there might be situations where reducer failures might be acceptable as well) then the hardcoded value of 4 attempts for mappers might be acceptable, but I would still like to be able to increase the number of reducer attempts. To prevent abuses from job submitters, one could add a reasonable cap to that value.

> MAX_TASK_FAILURES should be configurable
> ----------------------------------------
>
>                 Key: HADOOP-1304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1304
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>
> After a couple of weeks of failed attempts I was able to finish a large job only after I changed MAX_TASK_FAILURES to a higher value. In light of HADOOP-1144 (allowing a certain amount of task failures without failing the job) it would be even better if this value could be configured separately for mappers and reducers, because often a success of a job requires the success of all reducers but not of all mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.