You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2012/11/01 17:23:12 UTC
[jira] [Commented] (MAPREDUCE-4749) Killing multiple attempts of a task taker longer as more attempts are killed

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488802#comment-13488802 ] 

Robert Joseph Evans commented on MAPREDUCE-4749:
------------------------------------------------

If there are lots and lots of events for a job that is localizing then there could be a pause for each of these events, and yes it would slow the queue down even to the point prior to MAPREDUCE-4088 when all events would wait for the job to finish localizing. But the common case is much faster then the worst case, not that it is much comfort when you hit the worst case :). We could mitigate this by dropping the wait time to something smaller like 100ms so it would take 50 times as many events to slow it down the same amount.

I also agree that the tight loop will only happen when *ALL* the present actions in the queue are tainted. But I don't agree that it should be rare.  I think it is quite common to have a single event in the queue, or to have all of the events in the queue to be for a single job that is localizing.  Especially if all of the other jobs on this node are done localizing so their events get processed quickly and removed from the queue. The only time the thread would not be running is when the queue is empty.  I have not collected any real world numbers so I don't know how often that actually is in practice, or what percentage of the running time is just for checking.  If you feel that the extra CPU utilization is worth this then go ahead and remove the wait.  I am not opposed to it. I just wanted to point out the consequences of doing so. Also if you remove the wait, we should look at if we can remove the notify calls from the job as well.  If no one is ever going to wait the notifys become dead code. 

That being said, I agree with you Vinod that having separate queues is a better solution over all, but it is also a much larger change.  One that I don't know would provide that much more benefit compared to the risk of such a change.
                
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch
>
>
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 3 attempts in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was accepted but the time the attempt was actually killed was as stated above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira