You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Thomas Petr <tp...@hubspot.com> on 2016/06/17 21:27:17 UTC

Master slow to process status updates after massive killing of tasks?

Hey folks,

We got our Mesos cluster (0.28.1) into an interesting state during some
chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
outage, and then accidentally killed off almost all running tasks (a little
more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
test nonetheless!). During this time, we noticed:

- The time between our framework accepting an offer and the master
considering the task as launched spiked to ~2 minutes (which became doubly
problematic due to our 2 minute offer timeout)
- It would take up to 8 minutes for TASK_KILLED status updates from a slave
to be acknowledged by the master.
- The master logs contained tons of log lines mentioning "Performing
explicit task state reconciliation..."
- The killed agents took ~5 minutes to recover after I booted them back up.
- The whole time, resources were offered to the framework at a normal rate.

I understand that this is an exceptional situation, but does anyone ahve
any insight into exactly what's going on behind the scenes? Sounds like all
the status updates were backed up in a queue and the master was processing
them one at a time. Is there something we could have done better in our
framework <https://github.com/hubspot/singularity> to do this more
gracefully? Is there any sort of monitoring of the master backlog that we
can take advantage of?

Happy to include master / slave / framework logs if necessary.

Thanks,
Tom

Re: Master slow to process status updates after massive killing of tasks?

Posted by Joseph Wu <jo...@mesosphere.io>.
Looks like the master's event queue is filling up, although it's difficult
to tell what exactly is doing this.  From the numbers in the gist, it's
evident that the master has seconds to minutes of backlog.

In general, there is very little processing cost associated per "accept".
The master does, however, break an "accept" into two chunk which are placed
into the master's event queue (FIFO).  The first chunk logs "Processing
ACCEPT call for offers..." and queues the second chunk.  The second chunk
logs "Launching task..." (assuming this is what the offer was accepted
for).  The greater the time gap between the two logs, the more backlogged
the master is.

I don't think there's enough info to pinpoint the bottleneck.  If you ran
this test again, here are my recommendations:

   - Set up a monitor (i.e. script that polls) for
   /master/metrics/snapshot.  Look through this doc (
   http://mesos.apache.org/documentation/latest/monitoring/) to see what
   each value means.  The most interesting metrics would match the patterns "
   master/event_queue_*" and "
   master/messages_*".
   - Try to hit /__processes__ during your test, particularly when the
   master is backlogged.  This should show the state of the various event
   queues inside Mesos.  (Keep in mind that polling this endpoint *may*
   slow down Mesos.)
   - Check if Singularity is DOS-ing the master :)

>    Singularity calls reconcileTasks() every 10 minutes. How often would
>    you expect to see that log line? At the worst point, we saw it printed 637
>    times in one minute in the master logs.
>
   ^ This is a framework-initiated action.  Unfortunately, there are a lot
   of framework calls in the old scheduler driver that *could* be batched
   but are not due to backwards compatibility.  If Singularity tries to
   reconcile 500 tasks in a single reconcileTasks() call, using the old
   scheduler driver, it will make 500 calls to Mesos :(
   We suspect the HTTP API will have much better scaling in situations like
   this.  And it will be worthwhile to start migrating over to the new API.


On Sun, Jun 19, 2016 at 6:57 PM, Thomas Petr <tp...@hubspot.com> wrote:

> Thanks for the quick response, Joseph! Here are some answers:
>
> The test:
> - The agents were gracefully terminated (kill -term) and were offline for
> about 10 minutes. We had plans to test other scenarios (i.e. network
> partition, kill -9, etc.) but didn't get to them yet.
> - The 1000 accidentally killed tasks were not including the tasks from the
> killed-off agents, but included replacement tasks that were started in
> response to the agent killings. I'd estimate about 400 tasks were lost from
> the killed-off agents.
> - We stopped the 5 agents at about 3:43pm, killed off the ~1000 tasks at
> 3:49pm, and then failed over the master at 4:25pm. Singularity caught wind
> of the failover at 4:27pm, reconnected, and then everything started to
> clear up after that.
> - Singularity currently does not log the Offer ID, so it's not easy for me
> to get the exact timing between Singularity accepting an offer and that
> master line you mentioned. However, I am able to get the time between
> accepting an offer and the "Launching task XXX" master line
> <https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L3589> --
> you can check out this info here:
> https://gist.github.com/tpetr/fe0fecbcfa0a2c8e5889b9e70c0296e7. I have a
> PR <https://github.com/HubSpot/Singularity/pull/1099> to log the Offer ID
> in Singularity, so I'll be able to give you the exact timing the next time
> we run the test.
>
> The setup:
> - We unfortunately weren't monitoring those metrics, but will keep a close
> eye on that when we run this test again.
> - CPU usage was nominal -- CloudWatch reports less than 5% CPU utilization
> throughout the day, jumping to 10% temporarily when we failed over the
> Mesos master.
> - We run Singularity and the Mesos master together on 3 c3.2xlarges in AWS
> so there shouldn't be any bottleneck there.
> - One interesting thing I just noticed in the master logs is that the last
> "Processing ACCEPT call for offers" occurred at 3:50pm, though that could
> just be because after that time, things were so lagged that all of our
> offers timed out.
>
> Singularity:
> - Singularity still uses the default implicit acknowledgement feature of
> the scheduler driver. I filed a TODO for looking into explicit acks, but we
> do very little in the thread calling statusUpdate(). The only thing that
> could really slow things down is communication with ZooKeeper, which is a
> possibility.
> - Singularity calls reconcileTasks() every 10 minutes. How often would you
> expect to see that log line? At the worst point, we saw it printed 637
> times in one minute in the master logs.
>
> Thanks,
> Tom
>
> On Fri, Jun 17, 2016 at 6:26 PM, Joseph Wu <jo...@mesosphere.io> wrote:
>
>> A couple questions about your test:
>>
>> - By "killed off", were your agents killed permanently (i.e. powered off)
>> or temporarily (i.e. network partition).  And how long were your agents
>> killed/down during the test?
>> - How many of the 1000 accidentally killed tasks were running on your
>> killed-off agents vs your normal agents?
>> - What's the timeframe of the test?  Did everything happen in minutes?
>> Hours?
>> - Are you only running Singularity?  Any other frameworks in the picture?
>>
>> Questions about your setup:
>>
>> - Are you monitoring master metrics while this is happening?  The flood
>> of TASK_FILLED status updates may be filling up the master's event queue
>> (master/event_queue_messages).
>> - Are you monitoring the master's CPU usage?  If maxed out, the
>> bottleneck is probably the master :(
>> - Where is your framework running with regards to the master?  Is network
>> bandwidth/latency limited?
>> - Can you check the time between your framework accepting an offer and
>> master logging "Processing ACCEPT call for offers..."?
>>
>> Questions about Singularity:
>>
>> - Does Singularity handle status update acknowledgements explicitly?  Or
>> does it leave this up to the old scheduler driver (default)?
>> - When does Singularity use the reconcileTasks call?  That is the source
>> of the "Performing explicit task state reconciliation..." master log line.
>> This might be contributing to the slowness.
>>
>>
>> On Fri, Jun 17, 2016 at 2:27 PM, Thomas Petr <tp...@hubspot.com> wrote:
>>
>>> Hey folks,
>>>
>>> We got our Mesos cluster (0.28.1) into an interesting state during some
>>> chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
>>> outage, and then accidentally killed off almost all running tasks (a little
>>> more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
>>> test nonetheless!). During this time, we noticed:
>>>
>>> - The time between our framework accepting an offer and the master
>>> considering the task as launched spiked to ~2 minutes (which became doubly
>>> problematic due to our 2 minute offer timeout)
>>> - It would take up to 8 minutes for TASK_KILLED status updates from a
>>> slave to be acknowledged by the master.
>>> - The master logs contained tons of log lines mentioning "Performing
>>> explicit task state reconciliation..."
>>> - The killed agents took ~5 minutes to recover after I booted them back
>>> up.
>>> - The whole time, resources were offered to the framework at a normal
>>> rate.
>>>
>>> I understand that this is an exceptional situation, but does anyone ahve
>>> any insight into exactly what's going on behind the scenes? Sounds like all
>>> the status updates were backed up in a queue and the master was processing
>>> them one at a time. Is there something we could have done better in our
>>> framework <https://github.com/hubspot/singularity> to do this more
>>> gracefully? Is there any sort of monitoring of the master backlog that we
>>> can take advantage of?
>>>
>>> Happy to include master / slave / framework logs if necessary.
>>>
>>> Thanks,
>>> Tom
>>>
>>
>>
>

Re: Master slow to process status updates after massive killing of tasks?

Posted by Thomas Petr <tp...@hubspot.com>.
Thanks for the quick response, Joseph! Here are some answers:

The test:
- The agents were gracefully terminated (kill -term) and were offline for
about 10 minutes. We had plans to test other scenarios (i.e. network
partition, kill -9, etc.) but didn't get to them yet.
- The 1000 accidentally killed tasks were not including the tasks from the
killed-off agents, but included replacement tasks that were started in
response to the agent killings. I'd estimate about 400 tasks were lost from
the killed-off agents.
- We stopped the 5 agents at about 3:43pm, killed off the ~1000 tasks at
3:49pm, and then failed over the master at 4:25pm. Singularity caught wind
of the failover at 4:27pm, reconnected, and then everything started to
clear up after that.
- Singularity currently does not log the Offer ID, so it's not easy for me
to get the exact timing between Singularity accepting an offer and that
master line you mentioned. However, I am able to get the time between
accepting an offer and the "Launching task XXX" master line
<https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L3589> --
you can check out this info here:
https://gist.github.com/tpetr/fe0fecbcfa0a2c8e5889b9e70c0296e7. I have a PR
<https://github.com/HubSpot/Singularity/pull/1099> to log the Offer ID in
Singularity, so I'll be able to give you the exact timing the next time we
run the test.

The setup:
- We unfortunately weren't monitoring those metrics, but will keep a close
eye on that when we run this test again.
- CPU usage was nominal -- CloudWatch reports less than 5% CPU utilization
throughout the day, jumping to 10% temporarily when we failed over the
Mesos master.
- We run Singularity and the Mesos master together on 3 c3.2xlarges in AWS
so there shouldn't be any bottleneck there.
- One interesting thing I just noticed in the master logs is that the last
"Processing ACCEPT call for offers" occurred at 3:50pm, though that could
just be because after that time, things were so lagged that all of our
offers timed out.

Singularity:
- Singularity still uses the default implicit acknowledgement feature of
the scheduler driver. I filed a TODO for looking into explicit acks, but we
do very little in the thread calling statusUpdate(). The only thing that
could really slow things down is communication with ZooKeeper, which is a
possibility.
- Singularity calls reconcileTasks() every 10 minutes. How often would you
expect to see that log line? At the worst point, we saw it printed 637
times in one minute in the master logs.

Thanks,
Tom

On Fri, Jun 17, 2016 at 6:26 PM, Joseph Wu <jo...@mesosphere.io> wrote:

> A couple questions about your test:
>
> - By "killed off", were your agents killed permanently (i.e. powered off)
> or temporarily (i.e. network partition).  And how long were your agents
> killed/down during the test?
> - How many of the 1000 accidentally killed tasks were running on your
> killed-off agents vs your normal agents?
> - What's the timeframe of the test?  Did everything happen in minutes?
> Hours?
> - Are you only running Singularity?  Any other frameworks in the picture?
>
> Questions about your setup:
>
> - Are you monitoring master metrics while this is happening?  The flood of
> TASK_FILLED status updates may be filling up the master's event queue
> (master/event_queue_messages).
> - Are you monitoring the master's CPU usage?  If maxed out, the bottleneck
> is probably the master :(
> - Where is your framework running with regards to the master?  Is network
> bandwidth/latency limited?
> - Can you check the time between your framework accepting an offer and
> master logging "Processing ACCEPT call for offers..."?
>
> Questions about Singularity:
>
> - Does Singularity handle status update acknowledgements explicitly?  Or
> does it leave this up to the old scheduler driver (default)?
> - When does Singularity use the reconcileTasks call?  That is the source
> of the "Performing explicit task state reconciliation..." master log line.
> This might be contributing to the slowness.
>
>
> On Fri, Jun 17, 2016 at 2:27 PM, Thomas Petr <tp...@hubspot.com> wrote:
>
>> Hey folks,
>>
>> We got our Mesos cluster (0.28.1) into an interesting state during some
>> chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
>> outage, and then accidentally killed off almost all running tasks (a little
>> more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
>> test nonetheless!). During this time, we noticed:
>>
>> - The time between our framework accepting an offer and the master
>> considering the task as launched spiked to ~2 minutes (which became doubly
>> problematic due to our 2 minute offer timeout)
>> - It would take up to 8 minutes for TASK_KILLED status updates from a
>> slave to be acknowledged by the master.
>> - The master logs contained tons of log lines mentioning "Performing
>> explicit task state reconciliation..."
>> - The killed agents took ~5 minutes to recover after I booted them back
>> up.
>> - The whole time, resources were offered to the framework at a normal
>> rate.
>>
>> I understand that this is an exceptional situation, but does anyone ahve
>> any insight into exactly what's going on behind the scenes? Sounds like all
>> the status updates were backed up in a queue and the master was processing
>> them one at a time. Is there something we could have done better in our
>> framework <https://github.com/hubspot/singularity> to do this more
>> gracefully? Is there any sort of monitoring of the master backlog that we
>> can take advantage of?
>>
>> Happy to include master / slave / framework logs if necessary.
>>
>> Thanks,
>> Tom
>>
>
>

Re: Master slow to process status updates after massive killing of tasks?

Posted by Joseph Wu <jo...@mesosphere.io>.
A couple questions about your test:

- By "killed off", were your agents killed permanently (i.e. powered off)
or temporarily (i.e. network partition).  And how long were your agents
killed/down during the test?
- How many of the 1000 accidentally killed tasks were running on your
killed-off agents vs your normal agents?
- What's the timeframe of the test?  Did everything happen in minutes?
Hours?
- Are you only running Singularity?  Any other frameworks in the picture?

Questions about your setup:

- Are you monitoring master metrics while this is happening?  The flood of
TASK_FILLED status updates may be filling up the master's event queue
(master/event_queue_messages).
- Are you monitoring the master's CPU usage?  If maxed out, the bottleneck
is probably the master :(
- Where is your framework running with regards to the master?  Is network
bandwidth/latency limited?
- Can you check the time between your framework accepting an offer and
master logging "Processing ACCEPT call for offers..."?

Questions about Singularity:

- Does Singularity handle status update acknowledgements explicitly?  Or
does it leave this up to the old scheduler driver (default)?
- When does Singularity use the reconcileTasks call?  That is the source of
the "Performing explicit task state reconciliation..." master log line.
This might be contributing to the slowness.


On Fri, Jun 17, 2016 at 2:27 PM, Thomas Petr <tp...@hubspot.com> wrote:

> Hey folks,
>
> We got our Mesos cluster (0.28.1) into an interesting state during some
> chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
> outage, and then accidentally killed off almost all running tasks (a little
> more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
> test nonetheless!). During this time, we noticed:
>
> - The time between our framework accepting an offer and the master
> considering the task as launched spiked to ~2 minutes (which became doubly
> problematic due to our 2 minute offer timeout)
> - It would take up to 8 minutes for TASK_KILLED status updates from a
> slave to be acknowledged by the master.
> - The master logs contained tons of log lines mentioning "Performing
> explicit task state reconciliation..."
> - The killed agents took ~5 minutes to recover after I booted them back up.
> - The whole time, resources were offered to the framework at a normal rate.
>
> I understand that this is an exceptional situation, but does anyone ahve
> any insight into exactly what's going on behind the scenes? Sounds like all
> the status updates were backed up in a queue and the master was processing
> them one at a time. Is there something we could have done better in our
> framework <https://github.com/hubspot/singularity> to do this more
> gracefully? Is there any sort of monitoring of the master backlog that we
> can take advantage of?
>
> Happy to include master / slave / framework logs if necessary.
>
> Thanks,
> Tom
>