You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Michael Gummelt (JIRA)" <ji...@apache.org> on 2016/08/31 17:28:21 UTC

[jira] [Issue Comment Deleted] (MESOS-6112) Frameworks are starved when > 5 are run concurrently

     [ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Gummelt updated MESOS-6112:
-----------------------------------
    Comment: was deleted

(was: > When a framework declines an offer for 5s it says "I don't need these particular resources for the next 5s".

Sort of.  My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now.  I don't know when I may need them in the future.  Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)."

>  Or, even better, call suppressOffers()? Is it hard to understand / implement?

I can call {{suppressOffers()}}.  You're right, it's not that hard.  But it only partially solves the problem.  There will still exist practically unbounded periods of time when I can't suppress.  For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me.

But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable.



)

> Frameworks are starved when > 5 are run concurrently
> ----------------------------------------------------
>
>                 Key: MESOS-6112
>                 URL: https://issues.apache.org/jira/browse/MESOS-6112
>             Project: Mesos
>          Issue Type: Task
>          Components: allocation, master
>    Affects Versions: 1.0.1
>            Reporter: Michael Gummelt
>
> As I understand it, the master will send an offer to a list of frameworks ordered by DRF, until the offer is accepted.  There is a 1s wait time between each offering.  Once the decline timeout for the first framework has been reached, rather than continuing to submit the offer to the rest of the frameworks in the list, the master starts over at the beginning, starving the rest of the frameworks.
> This means that in order for Mesos to support > 5 concurrent frameworks, all frameworks must be good citizens and set their decline timeout to something large or suppress offers.  I think this is a fairly undesirable state of things.
> I propose that the master instead continues to submit the offer to every registered framework, even if the declineOffer timeout has been reached.
> The potential increase in task startup latency that could be introduced by this change can be obviated in part if we also make the master smarter about how long to wait between successive offers, rather than a static 1s.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)