You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2008/05/02 08:18:56 UTC

[jira] Commented: (HADOOP-3327) Shufflinge fetachers waited too long between map output fetch re-tries

    [ https://issues.apache.org/jira/browse/HADOOP-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593722#action_12593722 ] 

Amar Kamat commented on HADOOP-3327:
------------------------------------

As Runping mentioned that the map takes roughly 7mins and looking at the logs
{quote}
2008-04-30 17:32:49,981 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task task_200804301615_0003_m_000756_0
2008-04-30 17:45:38,438 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200804301615_0003_m_000756_0
2008-04-30 17:56:43,950 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200804301615_0003_m_000756_0
{quote}
Consider the following
1) The read timeout for the shuffler is 3min
2) The total time for sending one fetch-failure-notification would be ~7min (determined by the map runtime)
3)  For the first time the reducer will back of exponentially.
||attempt #||backoff||timeout||total-time||
|0|0|3 mins|3 min|
|1|4 sec|3 mins|4 sec + 6 min|
|2|8 sec|3 mins|12 sec + 9 min|
|3|16|3mins|28 sec + 12 min|
|4|32|3mins|60 sec + 15 min|
|5|64|3mins|124 sec + 18 min|
|6|128|3mins|252sec + 21min|
|7|256|3mins| 508sec + 24min|
i.e in total the reducer waits for 32.46 mins before sending the first failure notification.
4)  After (3) the fetch will be attempted twice, each with 7/2 min backoff before sending the fetch-failure-notification.
||attempt||backoff||timeout||total-time||
|1|3.5 mins|3 mins|6.5 mins|
|2|3.5 mins|3 mins|13 mins|
i.e the total of 13mins between the 2 ^nd^ and 3 ^rd^ failure notifications. 
----
The problem is that in this case the read timeout becomes significant as compared to the total-backoff and the map runtime.

> Shufflinge fetachers waited too long between map output fetch re-tries
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-3327
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3327
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.