You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Karthik Kambatla (JIRA)" <ji...@apache.org> on 2014/05/05 22:45:18 UTC
[jira] [Updated] (MAPREDUCE-5877) Inconsistency between JT/TT for tasks taking a long time to launch

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karthik Kambatla updated MAPREDUCE-5877:
----------------------------------------

    Description: 
For the tasks that take too long to launch (for genuine reasons like large distributed caches), JT expires the task. Depending on whether job recovery is enabled and the JT's restart state, another attempt is launched or not even when the JT is not restarted. The status of the attempt changes to "Error launching task". Meanwhile, the TT is not informed of this task expiry and eventually launches the task. Also, the "new" attempt might be assigned to the same TT leading to more inconsistent behavior. 

To avoid this, one can bump up the mapred.tasktracker.expiry.interval, but leading to long TT failure discovery times. 

We should have a per-job timeout for task launches/ heartbeat and JT/TT should be consistent in what they say.

  was:
For the tasks that take too long to launch (for genuine reasons like large distributed caches), JT expires the task. Depending on whether job recovery is enabled and the JT's restart state, another attempt is launched or not even when the JT is not restarted. The status of the attempt changes to "Error launching task". Meanwhile, the TT is not informed of this task expiry and eventually launches the task. 

To avoid this weird behavior, one can bump up the mapred.tasktracker.expiry.interval, but leading to long TT failure discovery times. 

We should have a per-job timeout for task launches/ heartbeat and JT/TT should be consistent in what they say.


> Inconsistency between JT/TT for tasks taking a long time to launch
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5877
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5877
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker, tasktracker
>    Affects Versions: 1.2.1
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> For the tasks that take too long to launch (for genuine reasons like large distributed caches), JT expires the task. Depending on whether job recovery is enabled and the JT's restart state, another attempt is launched or not even when the JT is not restarted. The status of the attempt changes to "Error launching task". Meanwhile, the TT is not informed of this task expiry and eventually launches the task. Also, the "new" attempt might be assigned to the same TT leading to more inconsistent behavior. 
> To avoid this, one can bump up the mapred.tasktracker.expiry.interval, but leading to long TT failure discovery times. 
> We should have a per-job timeout for task launches/ heartbeat and JT/TT should be consistent in what they say.



--
This message was sent by Atlassian JIRA
(v6.2#6252)