You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Todd Lipcon (Created) (JIRA)" <ji...@apache.org> on 2011/10/19 09:17:10 UTC

[jira] [Created] (MAPREDUCE-3210) Support delay scheduling for node locality in MR2's capacity scheduler

Support delay scheduling for node locality in MR2's capacity scheduler
----------------------------------------------------------------------

                 Key: MAPREDUCE-3210
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3210
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2
    Affects Versions: 0.23.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon


The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3210) Support delay scheduling for node locality in MR2's capacity scheduler

Posted by "Patrick Wendell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178627#comment-13178627 ] 

Patrick Wendell commented on MAPREDUCE-3210:
--------------------------------------------

I'm going to be addressing this as part of MAPREDUCE-3601 and can probably just add to the Capacity scheduler as well. 

Delay scheduling is going to be less efficient in MR2 due to the resource request model. Right now, when a map task needs to run, the MR AM creates three separate resource requests to the scheduler, one for a node-local container, one for a rack-local container, and another for an *any* container. However, the scheduler can't associate these in any way.

In the MR1 Fair scheduler, we basically triage a given request and accept worse levels of locality as time goes on - this won't be possible. In MR2, I don't see a better way than introducing some type of global delay for "any" requests and rack-local requests (the former exists already). It seems like this could lead to undesirable behaviour depending on the order of resource request arrivals.
                
> Support delay scheduling for node locality in MR2's capacity scheduler
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3210
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>
> The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-3210) Support delay scheduling for node locality in MR2's capacity scheduler

Posted by "Todd Lipcon (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon reassigned MAPREDUCE-3210:
--------------------------------------

    Assignee:     (was: Todd Lipcon)

Turns out the major locality issues I was seeing were related to data locality not being respected at all. This was fixed by MAPREDUCE-2693 (see also MAPREDUCE-3234)
                
> Support delay scheduling for node locality in MR2's capacity scheduler
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3210
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>
> The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3210) Support delay scheduling for node locality in MR2's capacity scheduler

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178780#comment-13178780 ] 

Robert Joseph Evans commented on MAPREDUCE-3210:
------------------------------------------------

Your concern #1 is already happening.  With MRV2 right now all the requests, global, rack local, and node specific are made at once.  This results in the possibility that on an underused cluster all of them might be fulfilled and returned to the AM.  If the AM can make use of one of the containers it will, otherwise it will release it.

Perhaps the better way to do this is to have the AM be responsible for making the requests at different times.  So for example on the first heartbeat after a container is needed only the node local request is made.  If it does not get it after a specific timeout (1 heartbeat by default) then a rack local request is added, and finally the global request is added after another timeout.

It would be nice to have it be more generic so that some how the requests are tied together, but that would require an API change and may not be simple to do in the short term.
                
> Support delay scheduling for node locality in MR2's capacity scheduler
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3210
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>
> The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3210) Support delay scheduling for node locality in MR2's capacity scheduler

Posted by "Patrick Wendell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178634#comment-13178634 ] 

Patrick Wendell commented on MAPREDUCE-3210:
--------------------------------------------

Just to be clear what I mean:

The current approach is to only schedule "any" requests once the scheduler has failed to allocate a node or rack local container anywhere for several NM check-ins. The corresponding approach for rack-locality is to only schedule rack-local once we've had a given number of global failures scheduling node-local requests.

My concerns are:

1) If the scheduler falls back onto rack-locality, it might fulfil a request for a rack-local container which has already been taken care of via a node-local request. This will be returned to the AM which will have no use for it and release the container. It might take number of rounds of offers to the AM for things to shake out correctly.

2) If a single rack is busy, it might take a long time to trigger the global failover to "any" requests.

Anyways, maybe these won't be a big deal. The first step is to just go ahead and do this and see how good of an approximation it is for a model where we have associations between resource requests.
                
> Support delay scheduling for node locality in MR2's capacity scheduler
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3210
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>
> The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira