You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Stephen Gran <st...@piksel.com> on 2016/04/29 18:26:27 UTC

Detecting resource issues

Hello,

We're running tasks on mesos, launched with marathon.  We label all the
agents with AWS availability zone and VPC name, so that tasks can be
scheduled to the right set of hosts.

I've noticed something that feels like, well, maybe not a bug, but
unexpected behavior.

We launch tasks with:

    "constraints": [
        [
            "az",
            "GROUP_BY",
            "3"
        ],
    ],
    "instances": 2,

this is eu-west-1, where there are 3 AZs.  We run agents in all 3 AZs.

On trying to restart an application, no new task was started.  Digging
around, I could see marathon decline any offers from mesos, which led us
to look a little closer.  It turned out that the 2 tasks in the
application were running in eu-west-1a and eu-west-1b.  All the agents
in eu-west-1c were fully subscribed and could not pick up any new work.

Once we figured this out, it was straight forward enough to rebalance
and let things sort themselves out.

So, with that as background:

It would have been nicer if marathon had realized that the state at the
start and the end of the transaction would be to run in only 2 of 3 AZs,
and allowed a new task to start in either eu-west-1a or eu-west-1b.  I
can see how that might be slightly harder to account for than just even
stacking.

It would be nice if a metric "a framework keeps asking for resource and
then declining offers" was available - it may already be, but I can't
find it.  This would at least make the issue visible.

I can see the metric for declined offers, but this also increments when
the framework declines offers because it doesn't need any additional
resource, so I'm not sure if it's helpful or not here.  Perhaps I need
to look at a second order derivative to see spikes in declines?  It does
look like the number of declines went way up during this period.

Like I said, I don't know if this is a bug, precisely, but it was a not
very visible failure to use resource, when there were actually plenty of
resources on offer.  I'd like to make these failures more visible to the
team, so any pointers would be helpful.

Cheers,

--
Stephen Gran
Senior Technical Architect

picture the possibilities | piksel.com

This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com and remove it from your system.

Piksel Inc is a company registered in the United States New York City, 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986

Re: Detecting resource issues

Posted by Stephen Gran <st...@piksel.com>.
Hi,

I will raise it over there as well, but it's also a mesos question - I'd 
like to detect this sort of issue, and it seems like it should be 
possible.  I'm just looking to see if anyone has already done this and 
can point me in the right direction.

Ultimately, we are going to be running several frameworks on mesos, and 
it seems like the right thing to detect that frameworks are not getting 
offers that they can accept in one place rather than several.

Cheers,

On 29/04/16 17:54, Vinod Kone wrote:
> This sounds like a feature request for marathon. Can you redirect this
> to the marathon mailing list?
>
> On Fri, Apr 29, 2016 at 9:26 AM, Stephen Gran <stephen.gran@piksel.com
> <ma...@piksel.com>> wrote:
>
>     Hello,
>
>     We're running tasks on mesos, launched with marathon.  We label all the
>     agents with AWS availability zone and VPC name, so that tasks can be
>     scheduled to the right set of hosts.
>
>     I've noticed something that feels like, well, maybe not a bug, but
>     unexpected behavior.
>
>     We launch tasks with:
>
>          "constraints": [
>              [
>                  "az",
>                  "GROUP_BY",
>                  "3"
>              ],
>          ],
>          "instances": 2,
>
>     this is eu-west-1, where there are 3 AZs.  We run agents in all 3 AZs.
>
>     On trying to restart an application, no new task was started.  Digging
>     around, I could see marathon decline any offers from mesos, which led us
>     to look a little closer.  It turned out that the 2 tasks in the
>     application were running in eu-west-1a and eu-west-1b.  All the agents
>     in eu-west-1c were fully subscribed and could not pick up any new work.
>
>     Once we figured this out, it was straight forward enough to rebalance
>     and let things sort themselves out.
>
>     So, with that as background:
>
>     It would have been nicer if marathon had realized that the state at the
>     start and the end of the transaction would be to run in only 2 of 3 AZs,
>     and allowed a new task to start in either eu-west-1a or eu-west-1b.  I
>     can see how that might be slightly harder to account for than just even
>     stacking.
>
>     It would be nice if a metric "a framework keeps asking for resource and
>     then declining offers" was available - it may already be, but I can't
>     find it.  This would at least make the issue visible.
>
>     I can see the metric for declined offers, but this also increments when
>     the framework declines offers because it doesn't need any additional
>     resource, so I'm not sure if it's helpful or not here.  Perhaps I need
>     to look at a second order derivative to see spikes in declines?  It does
>     look like the number of declines went way up during this period.
>
>     Like I said, I don't know if this is a bug, precisely, but it was a not
>     very visible failure to use resource, when there were actually plenty of
>     resources on offer.  I'd like to make these failures more visible to the
>     team, so any pointers would be helpful.
>
>     Cheers,
>
>     --
>     Stephen Gran
>     Senior Technical Architect
>
>     picture the possibilities | piksel.com <http://piksel.com>
>
>     This message is private and confidential. If you have received this
>     message in error, please notify the sender or servicedesk@piksel.com
>     <ma...@piksel.com> and remove it from your system.
>
>     Piksel Inc is a company registered in the United States New York
>     City, 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986
>
>

-- 
Stephen Gran
Senior Technical Architect

picture the possibilities | piksel.com

Re: Detecting resource issues

Posted by Vinod Kone <vi...@apache.org>.
This sounds like a feature request for marathon. Can you redirect this to
the marathon mailing list?

On Fri, Apr 29, 2016 at 9:26 AM, Stephen Gran <st...@piksel.com>
wrote:

> Hello,
>
> We're running tasks on mesos, launched with marathon.  We label all the
> agents with AWS availability zone and VPC name, so that tasks can be
> scheduled to the right set of hosts.
>
> I've noticed something that feels like, well, maybe not a bug, but
> unexpected behavior.
>
> We launch tasks with:
>
>     "constraints": [
>         [
>             "az",
>             "GROUP_BY",
>             "3"
>         ],
>     ],
>     "instances": 2,
>
> this is eu-west-1, where there are 3 AZs.  We run agents in all 3 AZs.
>
> On trying to restart an application, no new task was started.  Digging
> around, I could see marathon decline any offers from mesos, which led us
> to look a little closer.  It turned out that the 2 tasks in the
> application were running in eu-west-1a and eu-west-1b.  All the agents
> in eu-west-1c were fully subscribed and could not pick up any new work.
>
> Once we figured this out, it was straight forward enough to rebalance
> and let things sort themselves out.
>
> So, with that as background:
>
> It would have been nicer if marathon had realized that the state at the
> start and the end of the transaction would be to run in only 2 of 3 AZs,
> and allowed a new task to start in either eu-west-1a or eu-west-1b.  I
> can see how that might be slightly harder to account for than just even
> stacking.
>
> It would be nice if a metric "a framework keeps asking for resource and
> then declining offers" was available - it may already be, but I can't
> find it.  This would at least make the issue visible.
>
> I can see the metric for declined offers, but this also increments when
> the framework declines offers because it doesn't need any additional
> resource, so I'm not sure if it's helpful or not here.  Perhaps I need
> to look at a second order derivative to see spikes in declines?  It does
> look like the number of declines went way up during this period.
>
> Like I said, I don't know if this is a bug, precisely, but it was a not
> very visible failure to use resource, when there were actually plenty of
> resources on offer.  I'd like to make these failures more visible to the
> team, so any pointers would be helpful.
>
> Cheers,
>
> --
> Stephen Gran
> Senior Technical Architect
>
> picture the possibilities | piksel.com
>
> This message is private and confidential. If you have received this
> message in error, please notify the sender or servicedesk@piksel.com and
> remove it from your system.
>
> Piksel Inc is a company registered in the United States New York City,
> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986
>