You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@aurora.apache.org by "Erb, Stephan" <St...@blue-yonder.com> on 2017/04/18 16:59:21 UTC

Re: Why doesn't announcer delay until task indicates it's ready?

We have recently received an RB that aims to use `.healthchecksnooze` for the burn-in phase, guarding the state transition to RUNNING.

I am not sure if it is a good idea (e.g., as one get remain stuck in STARTING). In any case, it is worth a cross-reference: https://reviews.apache.org/r/58462/

From: David McLaughlin <dm...@apache.org>
Reply-To: "user@aurora.apache.org" <us...@aurora.apache.org>
Date: Tuesday, 21. March 2017 at 17:38
To: "user@aurora.apache.org" <us...@aurora.apache.org>
Subject: Re: Why doesn't announcer delay until task indicates it's ready?

"I'm curious why not? This seems like a fundamental requirement."

This was pretty controversial inside Twitter too. The idea is that the presence of any node in a serverset does not mean it's healthy, which is especially true long after Aurora has finished scheduling the task - so your RPC or routing layer should be able to detect and avoid the node until it recovers. Finagle solves a lot of this for Twitter, and tools like linkerd (which came out of some members of the Twitter traffic team) aim to solve in a more generic way in OSS - https://linkerd.io/.

There of course services for which the initial batch of failures to let the proxy know it's a bad node is unacceptable or don't have the necessary load balancing intelligence in place. So those services tend to manually register to serversets (and avoid AURORA-321) as you've resorted to.

On Tue, Mar 21, 2017 at 8:53 AM, Bill Farner <wf...@apache.org>> wrote:
Announcement is done immediately to announce presence of an instance for other services to determine what to do from there. A use case we considered was allowing monitoring of a service via HTTP before the service is ready for traffic. This is useful, for example, if the application has a long burn-in setup phase.

In your case, the expectation is that the load balancer (or other upstream service) handles and routes away from unavailable backends; whether it's because they are not yet ready or otherwise. This could be using independent health checks or retries, depending on what is available.


On Mar 21, 2017, 8:28 AM -0700, Richard Klancer <rp...@pobox.com>>, wrote:

Hi all,

I'm preparing to launch a public-facing Aurora based HTTP service. As
part of this exercise my team recently attempted to `aurora update`
the service while it was serving high request volume from an external
load generator.

We were surprised to find that our ops team was paged due to bursts of
502's from our frontend server, which routes external traffic to our
service using the serverset published by the Aurora announcer. Upon
investigation, we discovered that the serverset is announced as soon
as the thermos executor runs, even though the app is not ready to
serve requests right away. The 502s, of course, were due to the chosen
server not yet being able to respond to a connection request.

Last night I searched JIRA, the user and dev mailing lists, and the
thermos code, and I didn't see any conversations about delaying
announcement until the configured health check passes (thus indicating
that the server is ready to accept connections)

I'm curious why not? This seems like a fundamental requirement.

A couple notes. First, our frontend server doesn't support explicit
health checking, yet, though this will be implemented soon. Perhaps it
is considered the proper task of load balancers and frontend servers
to validate the health of servers in the serverset before routing
traffic to them?

Also, to work around this problem, we announced the serverset from the
app itself. This means we no longer have an 'announce' section in our
config, and thus no portmap. But http health checking is silently (in
0.12, though not 0.17) disabled if there is no thermos port named
'health'. We had our "admin" and "health" ports aliased, but with no
portmap I had to just rename "admin" to "health" everywhere in our job
definition. It works but it's a little silly. This was previously
noted in https://issues.apache.org/jira/browse/AURORA-321

Thanks in advance for any comments,

--Richard