You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Tao Jie (JIRA)" <ji...@apache.org> on 2018/06/20 04:15:00 UTC

[jira] [Commented] (MAPREDUCE-7110) Support delayed retry for MR task attempts

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517779#comment-16517779 ] 

Tao Jie commented on MAPREDUCE-7110:
------------------------------------

Add new parameter {mapreduce.job.task.delayed.retry.factor.ms}, if is {0}(default), task retry will not delay as current logic. When set to a positive value (eg. 5000), the first retry will start immediately, the second retry will delay for 5000ms, the third retry will delay for 2 * 5000ms, the next will delay for 4 * 5000ms, and so on. 

> Support delayed retry for MR task attempts
> ------------------------------------------
>
>                 Key: MAPREDUCE-7110
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.8.2, 3.1.0
>            Reporter: Tao Jie
>            Assignee: Tao Jie
>            Priority: Major
>         Attachments: MAPREDUCE-7110.001.patch, MAPREDUCE-7110.002.patch
>
>
> Today when map/reduce task fails, it would retry 4 times until success by default.
> In our product cluster, datanodes may be offline for a while. In a map task, when the 3 datanodes on which the accessed block replicated go offline at the same time, this map attempt will fail. However in current logic the appmaster will launch the retry attempts immediately, and the retries will very likely fail again if those datanodes do not recover very soon. As a result, it will cauce the job to fail even the job has been running for several hours.
> In such a situation, we could have a delayed retry mechanism. For example we can have the first retry immediately, then the second retry will wait for 10s, the third retry will wait longer.
> It could be an option especially for jobs that runs for a long time and will not modify the current logic by default. 
> Does it make sense？ Any thought？



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org