You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@aurora.apache.org by Mohit Jaggi <mo...@uber.com> on 2017/11/10 00:33:39 UTC

Aurora pauses adding offers

Folks,
I have noticed some weird behavior in Aurora (version close to 0.18.0).
Sometimes, it shows no offers in the UI offers page. But if I tail the logs
I can see offers are coming in. I suspect they are getting enqueued for
processing by "executor" but stay there for a long time and are not
processed either due to locking or thread starvation.

I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
calls are being made at that time. Could these calls starve the
OfferManager(e.g. by contending for some lock)? What should I be looking
for to debug this condition further?

Mohit.

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

>
> Do you mean there is a shared ZK connection for leadership and log
> replication


Nope, different connections.  The log is all managed through code linked
from libmesos, including its use of ZK.  Here's an example of some logs
from this code:

I1129 15:56:00.560479  9316 group.cpp:341] Group process
> (zookeeper-group(1)@192.168.33.7:8083) connected to ZooKeeper
> I1129 15:56:00.560817  9316 group.cpp:831] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1129 15:56:00.561249  9316 group.cpp:419] Trying to create path
> '/aurora/replicated-log' in ZooKeeper


Notice the group.cpp.  You'll also see relevant logs coming from log.cpp
and replica.cpp.


On Wed, Nov 29, 2017 at 3:25 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Do you mean there is a shared ZK connection for leadership and log
> replication? I don't see "Lost leadership, committing suicide" during
> outage. I do see it at other times.
>
> On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wf...@apache.org> wrote:
>
>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>
>>
>> Maintain, yes.  Shouldn't be impacted unless the ZK session expires,
>> which would trigger a scheduler failover.
>>
>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>
>>
>> Assuming this means 5 quorum member - no, that should not be a problem.
>>
>> If any of the above became an issue for the scheduler, it should
>> certainly manifest in logs.
>>
>>
>> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Thanks Bill. It would be the latter as this back-pressure is only needed
>>> for calls that change state. Read only calls should be quite quick to
>>> serve.
>>>
>>> One potential correlated outage we may have had is a NIC flap on a
>>> Zookeeper node. The following questions come to mind:
>>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>>
>>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wf...@apache.org>
>>> wrote:
>>>
>>>> is there a place I can inject a pre-processor for the API calls
>>>>
>>>>
>>>> There is no off-the-shelf way to do this.  You could intercept API
>>>> calls as HTTP requests similar to the CORS filter
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>>>> If you wanted to intercept specific calls and/or introspect arguments, you
>>>> would be better off binding a layer for AuroraAdmin.Iface
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>>>> .
>>>>
>>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> I agree with that. I also believe that the scheduler should be
>>>>> resilient in the presence of external faults. Systems that export an API
>>>>> must take defensive steps to protect themselves.
>>>>>
>>>>> If I wanted to experiment with this change without modifying Aurora
>>>>> code "inline", is there a place I can inject a pre-processor for the API
>>>>> calls?
>>>>>
>>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>>>> should offer quite high scheduling throughput, and i would rather focus
>>>>>> energy on addressing bottlenecks.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think more explicit signaling is better. Increased latency can be
>>>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>>>> back-pressure.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>>>> dmclaughlin@apache.org> wrote:
>>>>>>>
>>>>>>>> Any log write latency will be reflected in the overall latency of
>>>>>>>> the request. Increased request latency is one of the main ways any server
>>>>>>>> has of telling a client that it's under load. It's then up to the client to
>>>>>>>> react to this.
>>>>>>>>
>>>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>>>> need to build reconciliation logic into it.
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Imagine something like Spinnaker using Aurora underneath to
>>>>>>>>> schedule services. That layer often "amplifies" human effort and may result
>>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed down
>>>>>>>>> due to transient problems, it can signal that to upstream software in the
>>>>>>>>> same way that busy web servers do during cyber Monday sales :-)
>>>>>>>>>
>>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I want to let upstream software "know" that Aurora is slowing
>>>>>>>>>>> down and that it should back off
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>>>>>>> "defense" already exists?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfarner@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very
>>>>>>>>>>>> large, O(gb)
>>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper,
>>>>>>>>>>>> or between zookeepers
>>>>>>>>>>>>
>>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>>>
>>>>>>>>>>>> commit f2755e1
>>>>>>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>>>
>>>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary
>>>>>>>>>>>> storage
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>>>> wfarner@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through
>>>>>>>>>>>>>>>> though?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In one instance the offers did get through but it took
>>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler to let
>>>>>>>>>>>>>>> another one become the leader.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere
>>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow down the
>>>>>>>>>>>>>>>> scheduler as a whole.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for something
>>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If
>>>>>>>>>>>>>>> updates from Mesos are being delayed or processed too slowly both these
>>>>>>>>>>>>>>> effects will occur?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer
>>>>>>>>>>>>>>>> processing first, i suspect it might be related to the task timeouts as
>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at
>>>>>>>>>>>>>>>> least indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation
>>>>>>>>>>>>>>>    work done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..."
>>>>>>>>>>>>>>>>> around the same time. Can this happen if task is in PENDING state and does
>>>>>>>>>>>>>>>>> not reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version
>>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI offers page. But
>>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect they are
>>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there for a long
>>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Do you mean there is a shared ZK connection for leadership and log
replication? I don't see "Lost leadership, committing suicide" during
outage. I do see it at other times.

On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wf...@apache.org> wrote:

> - Does log replication "maintain" ZK connections and suffer when a NIC
>> flaps?
>
>
> Maintain, yes.  Shouldn't be impacted unless the ZK session expires, which
> would trigger a scheduler failover.
>
> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>
>
> Assuming this means 5 quorum member - no, that should not be a problem.
>
> If any of the above became an issue for the scheduler, it should certainly
> manifest in logs.
>
>
> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Thanks Bill. It would be the latter as this back-pressure is only needed
>> for calls that change state. Read only calls should be quite quick to
>> serve.
>>
>> One potential correlated outage we may have had is a NIC flap on a
>> Zookeeper node. The following questions come to mind:
>> - Does log replication "maintain" ZK connections and suffer when a NIC
>> flaps?
>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>
>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wf...@apache.org> wrote:
>>
>>> is there a place I can inject a pre-processor for the API calls
>>>
>>>
>>> There is no off-the-shelf way to do this.  You could intercept API calls
>>> as HTTP requests similar to the CORS filter
>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>>> If you wanted to intercept specific calls and/or introspect arguments, you
>>> would be better off binding a layer for AuroraAdmin.Iface
>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>>> .
>>>
>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> I agree with that. I also believe that the scheduler should be
>>>> resilient in the presence of external faults. Systems that export an API
>>>> must take defensive steps to protect themselves.
>>>>
>>>> If I wanted to experiment with this change without modifying Aurora
>>>> code "inline", is there a place I can inject a pre-processor for the API
>>>> calls?
>>>>
>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org>
>>>> wrote:
>>>>
>>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>>> should offer quite high scheduling throughput, and i would rather focus
>>>>> energy on addressing bottlenecks.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> I think more explicit signaling is better. Increased latency can be
>>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>>> back-pressure.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>>> dmclaughlin@apache.org> wrote:
>>>>>>
>>>>>>> Any log write latency will be reflected in the overall latency of
>>>>>>> the request. Increased request latency is one of the main ways any server
>>>>>>> has of telling a client that it's under load. It's then up to the client to
>>>>>>> react to this.
>>>>>>>
>>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>>> need to build reconciliation logic into it.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Imagine something like Spinnaker using Aurora underneath to
>>>>>>>> schedule services. That layer often "amplifies" human effort and may result
>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed down
>>>>>>>> due to transient problems, it can signal that to upstream software in the
>>>>>>>> same way that busy web servers do during cyber Monday sales :-)
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>>>> and that it should back off
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <
>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>>>>>> "defense" already exists?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>>
>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very
>>>>>>>>>>> large, O(gb)
>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper,
>>>>>>>>>>> or between zookeepers
>>>>>>>>>>>
>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>>
>>>>>>>>>>> commit f2755e1
>>>>>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>>
>>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary
>>>>>>>>>>> storage
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>>> wfarner@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through
>>>>>>>>>>>>>>> though?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In one instance the offers did get through but it took
>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler to let
>>>>>>>>>>>>>> another one become the leader.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere
>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow down the
>>>>>>>>>>>>>>> scheduler as a whole.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for something
>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these effects
>>>>>>>>>>>>>> will occur?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version
>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI offers page. But
>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect they are
>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there for a long
>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

>
> - Does log replication "maintain" ZK connections and suffer when a NIC
> flaps?


Maintain, yes.  Shouldn't be impacted unless the ZK session expires, which
would trigger a scheduler failover.

- If only 1 of 5 ZK's have this issue, could there still be a problem?


Assuming this means 5 quorum member - no, that should not be a problem.

If any of the above became an issue for the scheduler, it should certainly
manifest in logs.


On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Thanks Bill. It would be the latter as this back-pressure is only needed
> for calls that change state. Read only calls should be quite quick to
> serve.
>
> One potential correlated outage we may have had is a NIC flap on a
> Zookeeper node. The following questions come to mind:
> - Does log replication "maintain" ZK connections and suffer when a NIC
> flaps?
> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>
> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wf...@apache.org> wrote:
>
>> is there a place I can inject a pre-processor for the API calls
>>
>>
>> There is no off-the-shelf way to do this.  You could intercept API calls
>> as HTTP requests similar to the CORS filter
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>> If you wanted to intercept specific calls and/or introspect arguments, you
>> would be better off binding a layer for AuroraAdmin.Iface
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>> .
>>
>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> I agree with that. I also believe that the scheduler should be resilient
>>> in the presence of external faults. Systems that export an API must take
>>> defensive steps to protect themselves.
>>>
>>> If I wanted to experiment with this change without modifying Aurora code
>>> "inline", is there a place I can inject a pre-processor for the API calls?
>>>
>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org> wrote:
>>>
>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>> should offer quite high scheduling throughput, and i would rather focus
>>>> energy on addressing bottlenecks.
>>>>
>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> I think more explicit signaling is better. Increased latency can be
>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>> back-pressure.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>> dmclaughlin@apache.org> wrote:
>>>>>
>>>>>> Any log write latency will be reflected in the overall latency of the
>>>>>> request. Increased request latency is one of the main ways any server has
>>>>>> of telling a client that it's under load. It's then up to the client to
>>>>>> react to this.
>>>>>>
>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>> need to build reconciliation logic into it.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>>>> services. That layer often "amplifies" human effort and may result in a lot
>>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>>>> transient problems, it can signal that to upstream software in the same way
>>>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>>> and that it should back off
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>>>>> "defense" already exists?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>
>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>>>> O(gb)
>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>>>> between zookeepers
>>>>>>>>>>
>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>
>>>>>>>>>> commit f2755e1
>>>>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>
>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>> wfarner@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>>>>>>> become the leader.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>>>>>>> a whole.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for something
>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these effects
>>>>>>>>>>>>> will occur?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I
>>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Thanks Bill. It would be the latter as this back-pressure is only needed
for calls that change state. Read only calls should be quite quick to
serve.

One potential correlated outage we may have had is a NIC flap on a
Zookeeper node. The following questions come to mind:
- Does log replication "maintain" ZK connections and suffer when a NIC
flaps?
- If only 1 of 5 ZK's have this issue, could there still be a problem?

On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wf...@apache.org> wrote:

> is there a place I can inject a pre-processor for the API calls
>
>
> There is no off-the-shelf way to do this.  You could intercept API calls
> as HTTP requests similar to the CORS filter
> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
> If you wanted to intercept specific calls and/or introspect arguments, you
> would be better off binding a layer for AuroraAdmin.Iface
> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
> .
>
> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> I agree with that. I also believe that the scheduler should be resilient
>> in the presence of external faults. Systems that export an API must take
>> defensive steps to protect themselves.
>>
>> If I wanted to experiment with this change without modifying Aurora code
>> "inline", is there a place I can inject a pre-processor for the API calls?
>>
>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org> wrote:
>>
>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>> should offer quite high scheduling throughput, and i would rather focus
>>> energy on addressing bottlenecks.
>>>
>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> I think more explicit signaling is better. Increased latency can be due
>>>> to other conditions like network issues etc. Right now our mitigation
>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>> back-pressure.
>>>>
>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>> dmclaughlin@apache.org> wrote:
>>>>
>>>>> Any log write latency will be reflected in the overall latency of the
>>>>> request. Increased request latency is one of the main ways any server has
>>>>> of telling a client that it's under load. It's then up to the client to
>>>>> react to this.
>>>>>
>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>> need to build reconciliation logic into it.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>>> services. That layer often "amplifies" human effort and may result in a lot
>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>>> transient problems, it can signal that to upstream software in the same way
>>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>> and that it should back off
>>>>>>>
>>>>>>>
>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>> this behavior, so i would like to understand better.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>>>> "defense" already exists?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>> held.  Common causes include:
>>>>>>>>>
>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>>> O(gb)
>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>>> between zookeepers
>>>>>>>>>
>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>>> improve (1) and (1a).
>>>>>>>>>
>>>>>>>>> commit f2755e1
>>>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>
>>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>
>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>> wfarner@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>>>>>> become the leader.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>>>>>> a whole.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for something
>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these effects
>>>>>>>>>>>> will occur?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I
>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

>
> is there a place I can inject a pre-processor for the API calls


There is no off-the-shelf way to do this.  You could intercept API calls as
HTTP requests similar to the CORS filter
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
If you wanted to intercept specific calls and/or introspect arguments, you
would be better off binding a layer for AuroraAdmin.Iface
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
.

On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo...@uber.com> wrote:

> I agree with that. I also believe that the scheduler should be resilient
> in the presence of external faults. Systems that export an API must take
> defensive steps to protect themselves.
>
> If I wanted to experiment with this change without modifying Aurora code
> "inline", is there a place I can inject a pre-processor for the API calls?
>
> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org> wrote:
>
>> I'd also suggest focusing on the source of the congestion.  Aurora should
>> offer quite high scheduling throughput, and i would rather focus energy on
>> addressing bottlenecks.
>>
>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> I think more explicit signaling is better. Increased latency can be due
>>> to other conditions like network issues etc. Right now our mitigation
>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>> back-pressure.
>>>
>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>> dmclaughlin@apache.org> wrote:
>>>
>>>> Any log write latency will be reflected in the overall latency of the
>>>> request. Increased request latency is one of the main ways any server has
>>>> of telling a client that it's under load. It's then up to the client to
>>>> react to this.
>>>>
>>>> If you want to throw error codes, you can put a proxy in front of
>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>> need to build reconciliation logic into it.
>>>>
>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>> services. That layer often "amplifies" human effort and may result in a lot
>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>> transient problems, it can signal that to upstream software in the same way
>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>> and that it should back off
>>>>>>
>>>>>>
>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>> this behavior, so i would like to understand better.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>>> "defense" already exists?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The next level is to determine why the storage lock is being held.
>>>>>>>> Common causes include:
>>>>>>>>
>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>> O(gb)
>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>> between zookeepers
>>>>>>>>
>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>> improve (1) and (1a).
>>>>>>>>
>>>>>>>> commit f2755e1
>>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>
>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>>>>> that cause or effect? :-)
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfarner@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>>>>> become the leader.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>>>>> a whole.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies
>>>>>>>>>>>> to tasks in states where the scheduler is waiting for something else to act
>>>>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these effects
>>>>>>>>>>> will occur?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>
>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>    - 10 i
>>>>>>>>>>>    -
>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and does not
>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I
>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

I agree with that. I also believe that the scheduler should be resilient in
the presence of external faults. Systems that export an API must take
defensive steps to protect themselves.

If I wanted to experiment with this change without modifying Aurora code
"inline", is there a place I can inject a pre-processor for the API calls?

On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wf...@apache.org> wrote:

> I'd also suggest focusing on the source of the congestion.  Aurora should
> offer quite high scheduling throughput, and i would rather focus energy on
> addressing bottlenecks.
>
> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> I think more explicit signaling is better. Increased latency can be due
>> to other conditions like network issues etc. Right now our mitigation
>> involves load-shedding and we would rather have load-avoidance. Indeed
>> proxy is not a good option. Only Aurora "knows" when it wants to
>> back-pressure.
>>
>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>> dmclaughlin@apache.org> wrote:
>>
>>> Any log write latency will be reflected in the overall latency of the
>>> request. Increased request latency is one of the main ways any server has
>>> of telling a client that it's under load. It's then up to the client to
>>> react to this.
>>>
>>> If you want to throw error codes, you can put a proxy in front of Aurora
>>> that has request timeouts - which would send 503s to clients. But the issue
>>> with that is the requests are mostly non-idempotent so you'll need to build
>>> reconciliation logic into it.
>>>
>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>> services. That layer often "amplifies" human effort and may result in a lot
>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>> transient problems, it can signal that to upstream software in the same way
>>>> that busy web servers do during cyber Monday sales :-)
>>>>
>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>>> wrote:
>>>>
>>>>> I want to let upstream software "know" that Aurora is slowing down and
>>>>>> that it should back off
>>>>>
>>>>>
>>>>> Can you offer more detail about how Aurora is being used in this
>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>> this behavior, so i would like to understand better.
>>>>>
>>>>>
>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>>> "defense" already exists?
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> The next level is to determine why the storage lock is being held.
>>>>>>> Common causes include:
>>>>>>>
>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>> O(gb)
>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>> between zookeepers
>>>>>>>
>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>>>>> and (1a).
>>>>>>>
>>>>>>> commit f2755e1
>>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>
>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>>>> that cause or effect? :-)
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>>>> become the leader.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>>>> a whole.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies
>>>>>>>>>>> to tasks in states where the scheduler is waiting for something else to act
>>>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>> from Mesos are being delayed or processed too slowly both these effects
>>>>>>>>>> will occur?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>
>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>    - c85bffd
>>>>>>>>>>    - 10 i
>>>>>>>>>>    -
>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

I'd also suggest focusing on the source of the congestion.  Aurora should
offer quite high scheduling throughput, and i would rather focus energy on
addressing bottlenecks.

On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mo...@uber.com> wrote:

> I think more explicit signaling is better. Increased latency can be due to
> other conditions like network issues etc. Right now our mitigation involves
> load-shedding and we would rather have load-avoidance. Indeed proxy is not
> a good option. Only Aurora "knows" when it wants to back-pressure.
>
> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <dmclaughlin@apache.org
> > wrote:
>
>> Any log write latency will be reflected in the overall latency of the
>> request. Increased request latency is one of the main ways any server has
>> of telling a client that it's under load. It's then up to the client to
>> react to this.
>>
>> If you want to throw error codes, you can put a proxy in front of Aurora
>> that has request timeouts - which would send 503s to clients. But the issue
>> with that is the requests are mostly non-idempotent so you'll need to build
>> reconciliation logic into it.
>>
>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>> services. That layer often "amplifies" human effort and may result in a lot
>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>> transient problems, it can signal that to upstream software in the same way
>>> that busy web servers do during cyber Monday sales :-)
>>>
>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org>
>>> wrote:
>>>
>>>> I want to let upstream software "know" that Aurora is slowing down and
>>>>> that it should back off
>>>>
>>>>
>>>> Can you offer more detail about how Aurora is being used in this
>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>> this behavior, so i would like to understand better.
>>>>
>>>>
>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>> "defense" already exists?
>>>>>
>>>>>
>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> The next level is to determine why the storage lock is being held.
>>>>>> Common causes include:
>>>>>>
>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>> O(gb)
>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>> 2. scheduler replicated log on slow disks
>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>> between zookeepers
>>>>>>
>>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>>>> and (1a).
>>>>>>
>>>>>> commit f2755e1
>>>>>> Author: Bill Farner <wf...@apache.org>
>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>
>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>>> that cause or effect? :-)
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>>> become the leader.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The most likely culprit is contention for the storage write lock,
>>>>>>>>>>  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>> it_ns_total.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>>> a whole.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies
>>>>>>>>>> to tasks in states where the scheduler is waiting for something else to act
>>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>>>>> occur?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>
>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>    - c85bffd
>>>>>>>>>    - 10 i
>>>>>>>>>    -
>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>    - The custom patch is mainly the dynamic reservation work done
>>>>>>>>>    by Dmitri. We also have commits for offer/rescind race issue, setrootfs
>>>>>>>>>    patch (which is not upstreamed yet).
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Folks,
>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>
>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>
>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

I think more explicit signaling is better. Increased latency can be due to
other conditions like network issues etc. Right now our mitigation involves
load-shedding and we would rather have load-avoidance. Indeed proxy is not
a good option. Only Aurora "knows" when it wants to back-pressure.

On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <dm...@apache.org>
wrote:

> Any log write latency will be reflected in the overall latency of the
> request. Increased request latency is one of the main ways any server has
> of telling a client that it's under load. It's then up to the client to
> react to this.
>
> If you want to throw error codes, you can put a proxy in front of Aurora
> that has request timeouts - which would send 503s to clients. But the issue
> with that is the requests are mostly non-idempotent so you'll need to build
> reconciliation logic into it.
>
> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> Imagine something like Spinnaker using Aurora underneath to schedule
>> services. That layer often "amplifies" human effort and may result in a lot
>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>> transient problems, it can signal that to upstream software in the same way
>> that busy web servers do during cyber Monday sales :-)
>>
>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org> wrote:
>>
>>> I want to let upstream software "know" that Aurora is slowing down and
>>>> that it should back off
>>>
>>>
>>> Can you offer more detail about how Aurora is being used in this
>>> regard?  I haven't seen use cases in the past that would be amenable to
>>> this behavior, so i would like to understand better.
>>>
>>>
>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Thanks Bill. We havn't been able to track down a specific root
>>>> cause(although ZK node is known to have issues now and then but we don't
>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>> slowing down and that it should back off. To achieve this I want to send
>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>> "defense" already exists?
>>>>
>>>>
>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org>
>>>> wrote:
>>>>
>>>>> The next level is to determine why the storage lock is being held.
>>>>> Common causes include:
>>>>>
>>>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>> 2. scheduler replicated log on slow disks
>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>> between zookeepers
>>>>>
>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>>> and (1a).
>>>>>
>>>>> commit f2755e1
>>>>> Author: Bill Farner <wf...@apache.org>
>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>
>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>
>>>>>
>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>> that cause or effect? :-)
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>>> become the leader.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The most likely culprit is contention for the storage write lock,  observable
>>>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks. I will check that one.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>> offer processing directly, but could potentially slow down the scheduler as
>>>>>>>>> a whole.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>> metrics or logs that might help here?
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>
>>>>>>>>
>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>>>> occur?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>
>>>>>>>>> version close to 0.18.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>    -
>>>>>>>>    - c85bffd
>>>>>>>>    - 10 i
>>>>>>>>    -
>>>>>>>>    - s the commit from which we forked.
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    - The custom patch is mainly the dynamic reservation work done
>>>>>>>>    by Dmitri. We also have commits for offer/rescind race issue, setrootfs
>>>>>>>>    patch (which is not upstreamed yet).
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>
>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Folks,
>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>
>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by David McLaughlin <dm...@apache.org>.

Any log write latency will be reflected in the overall latency of the
request. Increased request latency is one of the main ways any server has
of telling a client that it's under load. It's then up to the client to
react to this.

If you want to throw error codes, you can put a proxy in front of Aurora
that has request timeouts - which would send 503s to clients. But the issue
with that is the requests are mostly non-idempotent so you'll need to build
reconciliation logic into it.

On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Imagine something like Spinnaker using Aurora underneath to schedule
> services. That layer often "amplifies" human effort and may result in a lot
> of load on Aurora. Usually that is fine but if Aurora slowed down due to
> transient problems, it can signal that to upstream software in the same way
> that busy web servers do during cyber Monday sales :-)
>
> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org> wrote:
>
>> I want to let upstream software "know" that Aurora is slowing down and
>>> that it should back off
>>
>>
>> Can you offer more detail about how Aurora is being used in this regard?
>> I haven't seen use cases in the past that would be amenable to this
>> behavior, so i would like to understand better.
>>
>>
>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Thanks Bill. We havn't been able to track down a specific root
>>> cause(although ZK node is known to have issues now and then but we don't
>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>> soon. In addition I want to let upstream software "know" that Aurora is
>>> slowing down and that it should back off. To achieve this I want to send
>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>> "defense" already exists?
>>>
>>>
>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org> wrote:
>>>
>>>> The next level is to determine why the storage lock is being held.
>>>> Common causes include:
>>>>
>>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>> 2. scheduler replicated log on slow disks
>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>> between zookeepers
>>>>
>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>> and (1a).
>>>>
>>>> commit f2755e1
>>>> Author: Bill Farner <wf...@apache.org>
>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>
>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>
>>>>
>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>
>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>> that cause or effect? :-)
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Bill. Please see inline:
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I suspect they are getting enqueued
>>>>>>>>
>>>>>>>>
>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>
>>>>>>>>
>>>>>>> In one instance the offers did get through but it took several
>>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>>> become the leader.
>>>>>>>
>>>>>>>
>>>>>>>> The most likely culprit is contention for the storage write lock,  observable
>>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks. I will check that one.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>
>>>>>>>>
>>>>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>>>>> whole.
>>>>>>>>
>>>>>>>>
>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>> metrics or logs that might help here?
>>>>>>>
>>>>>>>
>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>
>>>>>>>>
>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>
>>>>>>>
>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>>> occur?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I suggest digging into the cause of delayed offer processing first,
>>>>>>>> i suspect it might be related to the task timeouts as well.
>>>>>>>>
>>>>>>>> version close to 0.18.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>
>>>>>>>>
>>>>>>>    -
>>>>>>>    - c85bffd
>>>>>>>    - 10 i
>>>>>>>    -
>>>>>>>    - s the commit from which we forked.
>>>>>>>    -
>>>>>>>    -
>>>>>>>    -
>>>>>>>    - The custom patch is mainly the dynamic reservation work done
>>>>>>>    by Dmitri. We also have commits for offer/rescind race issue, setrootfs
>>>>>>>    patch (which is not upstreamed yet).
>>>>>>>    -
>>>>>>>    -
>>>>>>>    -
>>>>>>>    -
>>>>>>>    -
>>>>>>>
>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>
>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Folks,
>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>
>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Imagine something like Spinnaker using Aurora underneath to schedule
services. That layer often "amplifies" human effort and may result in a lot
of load on Aurora. Usually that is fine but if Aurora slowed down due to
transient problems, it can signal that to upstream software in the same way
that busy web servers do during cyber Monday sales :-)

On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wf...@apache.org> wrote:

> I want to let upstream software "know" that Aurora is slowing down and
>> that it should back off
>
>
> Can you offer more detail about how Aurora is being used in this regard?
> I haven't seen use cases in the past that would be amenable to this
> behavior, so i would like to understand better.
>
>
> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> Thanks Bill. We havn't been able to track down a specific root
>> cause(although ZK node is known to have issues now and then but we don't
>> have logs for the specific outages we had). We will plan to move to 0.19.x
>> soon. In addition I want to let upstream software "know" that Aurora is
>> slowing down and that it should back off. To achieve this I want to send
>> 5xx error codes back when update/rollback/kill etc are called and certain
>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>> "defense" already exists?
>>
>>
>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org> wrote:
>>
>>> The next level is to determine why the storage lock is being held.
>>> Common causes include:
>>>
>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>> 2. scheduler replicated log on slow disks
>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>> between zookeepers
>>>
>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
>>> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
>>> (1a).
>>>
>>> commit f2755e1
>>> Author: Bill Farner <wf...@apache.org>
>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>
>>>     Exclusively use Map-based in-memory stores for primary storage
>>>
>>>
>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>
>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>>>> cause or effect? :-)
>>>>>
>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Bill. Please see inline:
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I suspect they are getting enqueued
>>>>>>>
>>>>>>>
>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>
>>>>>>>
>>>>>> In one instance the offers did get through but it took several
>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>> become the leader.
>>>>>>
>>>>>>
>>>>>>> The most likely culprit is contention for the storage write lock,  observable
>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>>
>>>>>>
>>>>>> Thanks. I will check that one.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>
>>>>>>>
>>>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>>>> whole.
>>>>>>>
>>>>>>>
>>>>>> So these won't contend for locks with offer processing and task
>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>> metrics or logs that might help here?
>>>>>>
>>>>>>
>>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>
>>>>>>>
>>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>>> and it does not hear back (via a status update).
>>>>>>>
>>>>>>
>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>> occur?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I suggest digging into the cause of delayed offer processing first,
>>>>>>> i suspect it might be related to the task timeouts as well.
>>>>>>>
>>>>>>> version close to 0.18.0
>>>>>>>
>>>>>>>
>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>
>>>>>>>
>>>>>>    -
>>>>>>    - c85bffd
>>>>>>    - 10 i
>>>>>>    -
>>>>>>    - s the commit from which we forked.
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>>>>>    (which is not upstreamed yet).
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>
>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>> same time. Can this happen if task is in PENDING state and does not reach
>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>
>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>
>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>
>>>>>>>>> Mohit.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

>
> I want to let upstream software "know" that Aurora is slowing down and
> that it should back off


Can you offer more detail about how Aurora is being used in this regard?  I
haven't seen use cases in the past that would be amenable to this behavior,
so i would like to understand better.


On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mo...@uber.com> wrote:

> Thanks Bill. We havn't been able to track down a specific root
> cause(although ZK node is known to have issues now and then but we don't
> have logs for the specific outages we had). We will plan to move to 0.19.x
> soon. In addition I want to let upstream software "know" that Aurora is
> slowing down and that it should back off. To achieve this I want to send
> 5xx error codes back when update/rollback/kill etc are called and certain
> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
> "defense" already exists?
>
>
> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org> wrote:
>
>> The next level is to determine why the storage lock is being held.
>> Common causes include:
>>
>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>> 1a. long GC pauses in the scheduler, often induced by (1)
>> 2. scheduler replicated log on slow disks
>> 3. network issues between schedulers, schedulers to zookeeper, or between
>> zookeepers
>>
>> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
>> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
>> (1a).
>>
>> commit f2755e1
>> Author: Bill Farner <wf...@apache.org>
>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>
>>     Exclusively use Map-based in-memory stores for primary storage
>>
>>
>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> and in log_storage_write_lock_wait_ns_per_event
>>>
>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>>> cause or effect? :-)
>>>>
>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Thanks Bill. Please see inline:
>>>>>
>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I suspect they are getting enqueued
>>>>>>
>>>>>>
>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>
>>>>>>
>>>>> In one instance the offers did get through but it took several
>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>> become the leader.
>>>>>
>>>>>
>>>>>> The most likely culprit is contention for the storage write lock,  observable
>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>
>>>>>
>>>>> Thanks. I will check that one.
>>>>>
>>>>>
>>>>>>
>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>
>>>>>>
>>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>>> whole.
>>>>>>
>>>>>>
>>>>> So these won't contend for locks with offer processing and task
>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>> metrics or logs that might help here?
>>>>>
>>>>>
>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>> ASSIGNED due to lack of offers?
>>>>>>
>>>>>>
>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>> and it does not hear back (via a status update).
>>>>>>
>>>>>
>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>> occur?
>>>>>
>>>>>
>>>>>>
>>>>>> I suggest digging into the cause of delayed offer processing first, i
>>>>>> suspect it might be related to the task timeouts as well.
>>>>>>
>>>>>> version close to 0.18.0
>>>>>>
>>>>>>
>>>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>>>>> may prove tricky without knowing what code is in play.
>>>>>>
>>>>>>
>>>>>    -
>>>>>    - c85bffd
>>>>>    - 10 i
>>>>>    -
>>>>>    - s the commit from which we forked.
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>>>>    (which is not upstreamed yet).
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>
>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>
>>>>>
>>>>>
>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Folks,
>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>
>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>
>>>>>>>> Mohit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Thanks Bill. We havn't been able to track down a specific root
cause(although ZK node is known to have issues now and then but we don't
have logs for the specific outages we had). We will plan to move to 0.19.x
soon. In addition I want to let upstream software "know" that Aurora is
slowing down and that it should back off. To achieve this I want to send
5xx error codes back when update/rollback/kill etc are called and certain
metrics (like log write lock wait time) indicate heavy load. Perhaps, this
"defense" already exists?


On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wf...@apache.org> wrote:

> The next level is to determine why the storage lock is being held.  Common
> causes include:
>
> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
> 1a. long GC pauses in the scheduler, often induced by (1)
> 2. scheduler replicated log on slow disks
> 3. network issues between schedulers, schedulers to zookeeper, or between
> zookeepers
>
> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
> (1a).
>
> commit f2755e1
> Author: Bill Farner <wf...@apache.org>
> Date:   Tue Oct 24 23:34:09 2017 -0700
>
>     Exclusively use Map-based in-memory stores for primary storage
>
>
> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> and in log_storage_write_lock_wait_ns_per_event
>>
>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>> cause or effect? :-)
>>>
>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Thanks Bill. Please see inline:
>>>>
>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org>
>>>> wrote:
>>>>
>>>>> I suspect they are getting enqueued
>>>>>
>>>>>
>>>>> Just to be sure - the offers do eventually get through though?
>>>>>
>>>>>
>>>> In one instance the offers did get through but it took several minutes.
>>>> In other instances we reloaded the scheduler to let another one become the
>>>> leader.
>>>>
>>>>
>>>>> The most likely culprit is contention for the storage write lock,  observable
>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>
>>>>
>>>> Thanks. I will check that one.
>>>>
>>>>
>>>>>
>>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>>> calls are being made at that time
>>>>>
>>>>>
>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>> whole.
>>>>>
>>>>>
>>>> So these won't contend for locks with offer processing and task
>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>> Can't say if the spike in active is due to many bad connections or vice
>>>> versa or there was a 3rd source causing both of these. Are there any
>>>> metrics or logs that might help here?
>>>>
>>>>
>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>> ASSIGNED due to lack of offers?
>>>>>
>>>>>
>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>> and it does not hear back (via a status update).
>>>>>
>>>>
>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>> Mesos are being delayed or processed too slowly both these effects will
>>>> occur?
>>>>
>>>>
>>>>>
>>>>> I suggest digging into the cause of delayed offer processing first, i
>>>>> suspect it might be related to the task timeouts as well.
>>>>>
>>>>> version close to 0.18.0
>>>>>
>>>>>
>>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>>>> may prove tricky without knowing what code is in play.
>>>>>
>>>>>
>>>>    -
>>>>    - c85bffd
>>>>    - 10 i
>>>>    -
>>>>    - s the commit from which we forked.
>>>>    -
>>>>    -
>>>>    -
>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>>>    (which is not upstreamed yet).
>>>>    -
>>>>    -
>>>>    -
>>>>    -
>>>>    -
>>>>
>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>
>>>>
>>>>
>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>> ASSIGNED due to lack of offers?
>>>>>>
>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Folks,
>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>
>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>> should I be looking for to debug this condition further?
>>>>>>>
>>>>>>> Mohit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

The next level is to determine why the storage lock is being held.  Common
causes include:

1. storage snapshot slowness, when scheduler state is very large, O(gb)
1a. long GC pauses in the scheduler, often induced by (1)
2. scheduler replicated log on slow disks
3. network issues between schedulers, schedulers to zookeeper, or between
zookeepers

As an immediate (partial) remedy, i suggest you upgrade to eliminate the
use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
(1a).

commit f2755e1
Author: Bill Farner <wf...@apache.org>
Date:   Tue Oct 24 23:34:09 2017 -0700

    Exclusively use Map-based in-memory stores for primary storage


On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mo...@uber.com> wrote:

> and in log_storage_write_lock_wait_ns_per_event
>
> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>> cause or effect? :-)
>>
>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Thanks Bill. Please see inline:
>>>
>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org> wrote:
>>>
>>>> I suspect they are getting enqueued
>>>>
>>>>
>>>> Just to be sure - the offers do eventually get through though?
>>>>
>>>>
>>> In one instance the offers did get through but it took several minutes.
>>> In other instances we reloaded the scheduler to let another one become the
>>> leader.
>>>
>>>
>>>> The most likely culprit is contention for the storage write lock,  observable
>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>
>>>
>>> Thanks. I will check that one.
>>>
>>>
>>>>
>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>> calls are being made at that time
>>>>
>>>>
>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>> processing directly, but could potentially slow down the scheduler as a
>>>> whole.
>>>>
>>>>
>>> So these won't contend for locks with offer processing and task
>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>> machine. I also noticed a spike in mybatis active and bad connections.
>>> Can't say if the spike in active is due to many bad connections or vice
>>> versa or there was a 3rd source causing both of these. Are there any
>>> metrics or logs that might help here?
>>>
>>>
>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>> ASSIGNED due to lack of offers?
>>>>
>>>>
>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>> tasks in states where the scheduler is waiting for something else to act
>>>> and it does not hear back (via a status update).
>>>>
>>>
>>> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
>>> are being delayed or processed too slowly both these effects will occur?
>>>
>>>
>>>>
>>>> I suggest digging into the cause of delayed offer processing first, i
>>>> suspect it might be related to the task timeouts as well.
>>>>
>>>> version close to 0.18.0
>>>>
>>>>
>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>>> may prove tricky without knowing what code is in play.
>>>>
>>>>
>>>    -
>>>    - c85bffd
>>>    - 10 i
>>>    -
>>>    - s the commit from which we forked.
>>>    -
>>>    -
>>>    -
>>>    - The custom patch is mainly the dynamic reservation work done by
>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>>    (which is not upstreamed yet).
>>>    -
>>>    -
>>>    -
>>>    -
>>>    -
>>>
>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>
>>>
>>>
>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>> ASSIGNED due to lack of offers?
>>>>>
>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Folks,
>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>> are not processed either due to locking or thread starvation.
>>>>>>
>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>> should I be looking for to debug this condition further?
>>>>>>
>>>>>> Mohit.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

and in log_storage_write_lock_wait_ns_per_event

On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
> cause or effect? :-)
>
> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Thanks Bill. Please see inline:
>>
>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org> wrote:
>>
>>> I suspect they are getting enqueued
>>>
>>>
>>> Just to be sure - the offers do eventually get through though?
>>>
>>>
>> In one instance the offers did get through but it took several minutes.
>> In other instances we reloaded the scheduler to let another one become the
>> leader.
>>
>>
>>> The most likely culprit is contention for the storage write lock,  observable
>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>
>>
>> Thanks. I will check that one.
>>
>>
>>>
>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>> calls are being made at that time
>>>
>>>
>>> This sounds like API activity.  This shouldn't interfere with offer
>>> processing directly, but could potentially slow down the scheduler as a
>>> whole.
>>>
>>>
>> So these won't contend for locks with offer processing and task
>> assignment threads? Only 8-10 out of 24 cores were being used on the
>> machine. I also noticed a spike in mybatis active and bad connections.
>> Can't say if the spike in active is due to many bad connections or vice
>> versa or there was a 3rd source causing both of these. Are there any
>> metrics or logs that might help here?
>>
>>
>>> I also notice a lot of "Timeout reached for task..." around the same
>>>> time. Can this happen if task is in PENDING state and does not reach
>>>> ASSIGNED due to lack of offers?
>>>
>>>
>>> This is unusual.  Pending tasks are not timed out; this applies to tasks
>>> in states where the scheduler is waiting for something else to act and it
>>> does not hear back (via a status update).
>>>
>>
>> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
>> are being delayed or processed too slowly both these effects will occur?
>>
>>
>>>
>>> I suggest digging into the cause of delayed offer processing first, i
>>> suspect it might be related to the task timeouts as well.
>>>
>>> version close to 0.18.0
>>>
>>>
>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>> may prove tricky without knowing what code is in play.
>>>
>>>
>>    -
>>    - c85bffd
>>    - 10 i
>>    -
>>    - s the commit from which we forked.
>>    -
>>    -
>>    -
>>    - The custom patch is mainly the dynamic reservation work done by
>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>    (which is not upstreamed yet).
>>    -
>>    -
>>    -
>>    -
>>    -
>>
>> I have cherrypicked the fix for Aurora-1952 as well.
>>
>>
>>
>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>> time. Can this happen if task is in PENDING state and does not reach
>>>> ASSIGNED due to lack of offers?
>>>>
>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Folks,
>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail
>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>> are not processed either due to locking or thread starvation.
>>>>>
>>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>> calls are being made at that time. Could these calls starve the
>>>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>>>> for to debug this condition further?
>>>>>
>>>>> Mohit.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that cause
or effect? :-)

On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Thanks Bill. Please see inline:
>
> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org> wrote:
>
>> I suspect they are getting enqueued
>>
>>
>> Just to be sure - the offers do eventually get through though?
>>
>>
> In one instance the offers did get through but it took several minutes. In
> other instances we reloaded the scheduler to let another one become the
> leader.
>
>
>> The most likely culprit is contention for the storage write lock,  observable
>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>
>
> Thanks. I will check that one.
>
>
>>
>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>> calls are being made at that time
>>
>>
>> This sounds like API activity.  This shouldn't interfere with offer
>> processing directly, but could potentially slow down the scheduler as a
>> whole.
>>
>>
> So these won't contend for locks with offer processing and task assignment
> threads? Only 8-10 out of 24 cores were being used on the machine. I also
> noticed a spike in mybatis active and bad connections. Can't say if the
> spike in active is due to many bad connections or vice versa or there was a
> 3rd source causing both of these. Are there any metrics or logs that might
> help here?
>
>
>> I also notice a lot of "Timeout reached for task..." around the same
>>> time. Can this happen if task is in PENDING state and does not reach
>>> ASSIGNED due to lack of offers?
>>
>>
>> This is unusual.  Pending tasks are not timed out; this applies to tasks
>> in states where the scheduler is waiting for something else to act and it
>> does not hear back (via a status update).
>>
>
> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
> are being delayed or processed too slowly both these effects will occur?
>
>
>>
>> I suggest digging into the cause of delayed offer processing first, i
>> suspect it might be related to the task timeouts as well.
>>
>> version close to 0.18.0
>>
>>
>> Is the ambiguity is due to custom patches?  Can you at least indicate the
>> last git SHA off aurora/master?  Digging much deeper in diagnosing this may
>> prove tricky without knowing what code is in play.
>>
>>
>    -
>    - c85bffd
>    - 10 i
>    -
>    - s the commit from which we forked.
>    -
>    -
>    -
>    - The custom patch is mainly the dynamic reservation work done by
>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>    (which is not upstreamed yet).
>    -
>    -
>    -
>    -
>    -
>
> I have cherrypicked the fix for Aurora-1952 as well.
>
>
>
>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com> wrote:
>>
>>> I also notice a lot of "Timeout reached for task..." around the same
>>> time. Can this happen if task is in PENDING state and does not reach
>>> ASSIGNED due to lack of offers?
>>>
>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Folks,
>>>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>>>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>>>> I can see offers are coming in. I suspect they are getting enqueued for
>>>> processing by "executor" but stay there for a long time and are not
>>>> processed either due to locking or thread starvation.
>>>>
>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>> calls are being made at that time. Could these calls starve the
>>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>>> for to debug this condition further?
>>>>
>>>> Mohit.
>>>>
>>>
>>>
>>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

Thanks Bill. Please see inline:

On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wf...@apache.org> wrote:

> I suspect they are getting enqueued
>
>
> Just to be sure - the offers do eventually get through though?
>
>
In one instance the offers did get through but it took several minutes. In
other instances we reloaded the scheduler to let another one become the
leader.


> The most likely culprit is contention for the storage write lock,  observable
> via spikes in stat log_storage_write_lock_wait_ns_total.
>

Thanks. I will check that one.


>
> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>> calls are being made at that time
>
>
> This sounds like API activity.  This shouldn't interfere with offer
> processing directly, but could potentially slow down the scheduler as a
> whole.
>
>
So these won't contend for locks with offer processing and task assignment
threads? Only 8-10 out of 24 cores were being used on the machine. I also
noticed a spike in mybatis active and bad connections. Can't say if the
spike in active is due to many bad connections or vice versa or there was a
3rd source causing both of these. Are there any metrics or logs that might
help here?


> I also notice a lot of "Timeout reached for task..." around the same time.
>> Can this happen if task is in PENDING state and does not reach ASSIGNED due
>> to lack of offers?
>
>
> This is unusual.  Pending tasks are not timed out; this applies to tasks
> in states where the scheduler is waiting for something else to act and it
> does not hear back (via a status update).
>

Perhaps they were in ASSIGNED or some other state. If updates from Mesos
are being delayed or processed too slowly both these effects will occur?


>
> I suggest digging into the cause of delayed offer processing first, i
> suspect it might be related to the task timeouts as well.
>
> version close to 0.18.0
>
>
> Is the ambiguity is due to custom patches?  Can you at least indicate the
> last git SHA off aurora/master?  Digging much deeper in diagnosing this may
> prove tricky without knowing what code is in play.
>
>
   -
   - c85bffd
   - 10 i
   -
   - s the commit from which we forked.
   -
   -
   -
   - The custom patch is mainly the dynamic reservation work done by
   Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
   (which is not upstreamed yet).
   -
   -
   -
   -
   -

I have cherrypicked the fix for Aurora-1952 as well.



> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> I also notice a lot of "Timeout reached for task..." around the same
>> time. Can this happen if task is in PENDING state and does not reach
>> ASSIGNED due to lack of offers?
>>
>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com> wrote:
>>
>>> Folks,
>>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>>> I can see offers are coming in. I suspect they are getting enqueued for
>>> processing by "executor" but stay there for a long time and are not
>>> processed either due to locking or thread starvation.
>>>
>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>> calls are being made at that time. Could these calls starve the
>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>> for to debug this condition further?
>>>
>>> Mohit.
>>>
>>
>>
>

Re: Aurora pauses adding offers

Posted by Bill Farner <wf...@apache.org>.

>
> I suspect they are getting enqueued

Just to be sure - the offers do eventually get through though?

The most likely culprit is contention for the storage write lock,  observable
via spikes in stat log_storage_write_lock_wait_ns_total.

I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
> calls are being made at that time

This sounds like API activity.  This shouldn't interfere with offer
processing directly, but could potentially slow down the scheduler as a
whole.

I also notice a lot of "Timeout reached for task..." around the same time.
> Can this happen if task is in PENDING state and does not reach ASSIGNED due
> to lack of offers?

This is unusual.  Pending tasks are not timed out; this applies to tasks in
states where the scheduler is waiting for something else to act and it does
not hear back (via a status update).

I suggest digging into the cause of delayed offer processing first, i
suspect it might be related to the task timeouts as well.

version close to 0.18.0

Is the ambiguity is due to custom patches?  Can you at least indicate the
last git SHA off aurora/master?  Digging much deeper in diagnosing this may
prove tricky without knowing what code is in play.

On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mo...@uber.com> wrote:

> I also notice a lot of "Timeout reached for task..." around the same time.
> Can this happen if task is in PENDING state and does not reach ASSIGNED due
> to lack of offers?
>
> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Folks,
>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>> I can see offers are coming in. I suspect they are getting enqueued for
>> processing by "executor" but stay there for a long time and are not
>> processed either due to locking or thread starvation.
>>
>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>> calls are being made at that time. Could these calls starve the
>> OfferManager(e.g. by contending for some lock)? What should I be looking
>> for to debug this condition further?
>>
>> Mohit.
>>
>
>

Re: Aurora pauses adding offers

Posted by Mohit Jaggi <mo...@uber.com>.

I also notice a lot of "Timeout reached for task..." around the same time.
Can this happen if task is in PENDING state and does not reach ASSIGNED due
to lack of offers?

On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Folks,
> I have noticed some weird behavior in Aurora (version close to 0.18.0).
> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
> I can see offers are coming in. I suspect they are getting enqueued for
> processing by "executor" but stay there for a long time and are not
> processed either due to locking or thread starvation.
>
> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
> calls are being made at that time. Could these calls starve the
> OfferManager(e.g. by contending for some lock)? What should I be looking
> for to debug this condition further?
>
> Mohit.
>