You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Andy Konwinski (JIRA)" <ji...@apache.org> on 2009/05/07 12:24:31 UTC
[jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time

    [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706790#action_12706790 ] 

Andy Konwinski commented on HADOOP-2141:
----------------------------------------

The current patch contains the changes discussed (see my responses below).

2.  We are now using the the task dispatch time from the JT as the base time to estimate progress so that the time estimates are accurate and also account for potential laggard behavior of a node due to network problems/latency.

5. I've put locality preference back in for speculative maps.

6. I implemented isSlowTracker as I described above, with the number of standard deviations that a TT has to be below the global average specified in the conf file (with a default of 1 std).

7. I just removed this filter, we now allow speculation if there is more than one task.

* Also, I changed the behavior of the filter that only allows tasks that have run for more than a minute to be speculated. To do this for now, I've set it to 0, which means that tasks aren't being filtered, but this way we can keep an eye out while testing and easily turn it back on if we want the filter back. I think it is just a remnant of the original speculative execution heuristic.

I have been testing this patch on small sort jobs on a 10 node EC2 cluster for a couple of days now. I've been simulating laggards by running nice -n -20 ruby -e "while true;;end" loops also dd if=/dev/zero of=/tmp/tmpfile bs=100000. Hopefully large scale testing will flush out any bugs I've missed.

Other thoughts and some ideas for near term future work:

* As we've talked about some already, after this patch gets tested and committed, we should update the way we calculate task progress, probably normalizing by data input size to task. Also, we might think about using only the first two phases of the reduce tasks to estimate the performance of Task Trackers because we know more about their behavior.

* We should further improve isSlowTracker() with regards to how we handle Task Trackers that have not reported any successful tasks for this job. Right now if a TT is a laggard and 1) is really slow, or 2) was added to the cluster near the end of a job, or 3) the job is smaller than the cluster size and is thus spreading out its tasks thinly; then the task tracker might not have reported a successful task by the time we start looking to run speculative tasks. In this case we don't know if the task tracker is a laggard since we use a TT's history to determine if it is slow or not. Currently, we just assume it might be a laggard and thus isSlowTracker() will return true. In the near future it will be better to allow assignment of a spec task to a TT if:
1) the TT has run at least one successful task for this job already and it's average task duration is less than slowNodeThreshold standard deviations below the average task duration of all completed tasks for this job.
2) if the TT has not been assigned any tasks for this job yet (i.e. has been assigned a task for this job bus the task has not completed yet)

* Finally, we might want to think up some unit test cases for speculative execution.

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, HADOOP-2141.patch, HADOOP-2141.v7.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative instance of a task is that it must be at least 20% behind the average progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.