You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Alex Rukletsov <al...@mesosphere.com> on 2017/11/03 16:44:01 UTC

Re: mesos health checks

+ dev list for visibility and history.

Okay, let's dig into this a little bit : ).

First, it is true that Marathon and Mesos HTTP health checks are not
equivalent. It's not just 1xx status codes, you can't have multiple Mesos
health checks for example. I don't understand why you say that the operator
should know that failed is an expected response. It is not! Health checks
do not have a concept of "not ready yet", grace period serves this purpose.
The health check has failed because the contract had been violated: 111 is
considered a failure. If you think that 1xx codes should be treated as
success — let's have this discussion separately, probably on the dev list
(btw, k8s does the same
<https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request>
).

Second, are you sure about the status code in the second case? The error
does not say anything about empty body, but empty reply. From what I can see
<https://stackoverflow.com/questions/41290792/my-curl-post-gets-empty-reply-from-server>,
(52) means a misbehaving server. If you're convinced that your server
returned a proper HTTP response with some status code but with empty body,
please file a bug report against Mesos jira.

On Fri, Nov 3, 2017 at 2:20 PM, Alex Rukletsov <al...@mesosphere.com> wrote:

> Tomas, can I reply to you and cc devlist to have our discussion logged
> publicly?
>
>
> On Fri, Nov 3, 2017 at 10:43 AM, Tomas Barton <ba...@gmail.com>
> wrote:
>
>> Hi Alex,
>>
>> I'm quite ok with the current contract, treat "codes between 200 and 399
>> as success" seems reasonable for me. We're using code < 200 for "not
>> ready yet" and >= 500 for error states.
>>
>> But that's not really the problem. While Marathon's implementation only
>> checked the HTTP code, curl tends to be too smart. Meaning that going from
>> Marathon healthcheck to MESOS based might introduce some incompatibility.
>>
>> For example:
>>
>> (2017-11-02 19:31:25) [INFO    ] Request: 127.0.0.1:44172 0x1fcc44f0
>> HTTP/1.1 GET /health
>> (2017-11-02 19:31:25) [INFO    ] Response: 0x1fcc44f0 /health 111 0
>> I1102 19:31:25.548070 23822 checker_process.cpp:959] HTTP health check
>> for task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' returned: 111
>> W1102 19:31:25.548195 23822 health_checker.cpp:317] HTTP health check for
>> task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' failed: Unexpected
>> HTTP response code: 111
>>
>> This is sort of ok, the operator should know that "failed: Unexpected
>> HTTP response code: 111" isn't really a failure but an expected response.
>>
>> But in order to get this we had to hack into HTTP server and introduce
>> some "special" HTTP codes.
>>
>> Another component where health checks on Marathon we responding as
>> expected, behaves funny with MESOS_HTTP:
>>
>> W1102 10:50:38.637907     6 health_checker.cpp:307] HTTP health check for
>> task 'xxx' failed: curl exited with status 52: curl: (52) Empty reply from
>> server
>> I1102 10:50:38.637949     6 health_checker.cpp:333] Ignoring failure of
>> HTTP health check for task 'xxx': still in grace period
>>
>> In this case the response code was either 100 or 111. Hard to tell from
>> the logs as the return code is not logged. The problem is, that the
>> component is written in Java, where some library for creating simple
>> webserver responds to /health endpoint is using underneath pretty standard
>> Jetty server. And Jetty decided that responses with code 1xx doesn't have
>> to send body response. On the other side curl thinks that HTTP response
>> with 1xx should have body response, thus the error code (52) Empty reply
>> from server. Maybe we should simply respond with HTTP 418 I'm a teapot,
>> meaning that the tea is not ready yet :)
>>
>> So, the question is, could be curl configured in a way where it doesn't
>> check for body content? And if body is present include it in logs?
>>
>> Or should I file bug reports to all web servers to include Mesos
>> compatible http responses? :)
>>
>> Thanks!
>> Tomas
>>
>>
>> On 2 November 2017 at 19:58, Alex Rukletsov <al...@mesosphere.com> wrote:
>>
>>> Hi Tomas!
>>>
>>> I wanted to make health checks as simple as possible. I had looked at
>>> what aws, k8s, and nomad do and decided that I will not support
>>> customization for return codes unless someone shows me a very good reason
>>> to do so. Such customization is not easy, once you start it, people will
>>> want more and more, think about API: enumerate "good" codes, enumerate
>>> "bad" codes, specify "good" range, specify "bad" range, specify set of
>>> "good" ranges, and so on.
>>>
>>> Regarding the empty reply, why an empty reply should be considered ok?
>>> The contract is very explicit: "Default executors treat return codes
>>> between 200 and 399 as success; custom executors may employ a different
>>> strategy, e.g. leveraging the `statuses` field."
>>>
>>> And it actually should affect app scaling, as the task should be
>>> considered unhealthy.
>>>
>>> So—give me a good reason to change my mind ; )
>>>
>>> —Alex
>>>
>>> On Thu, Nov 2, 2017 at 4:43 PM, Tomas Barton <ba...@gmail.com>
>>> wrote:
>>>
>>>> Hi Alex,
>>>>
>>>> one more question regarding health checks. Marathon health checks has
>>>> option to ignore 1xx error codes: ignoreHttp1xx.
>>>>
>>>> If I understand correctly MESOS_HTTP checks there's no option to apply
>>>> similar behaviour. What's the motivation?
>>>>
>>>> When using Mesos health checks I see following in logs:
>>>>
>>>> I1102 10:50:49.891046    12 health_checker.cpp:333] Ignoring failure of HTTP health check for task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091': still in grace period
>>>> W1102 10:51:20.690042    10 health_checker.cpp:307] HTTP health check for task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed: curl exited with status 52: curl: (52) Empty reply from server
>>>> I1102 10:51:20.690389    10 health_checker.cpp:333] Ignoring failure of HTTP health check for task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091': still in grace period
>>>> W1102 10:51:51.391033    12 health_checker.cpp:307] HTTP health check for task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed: curl exited with status 52: curl: (52) Empty reply from server
>>>> W1102 10:51:51.391294    12 health_checker.cpp:339] HTTP health check for task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed 1 times consecutively
>>>>
>>>> It would be much more useful if the Mesos health checked returned the
>>>> corresponding code instead of `curl: (52) Empty reply from server`.
>>>>
>>>> It doesn't affect the app scaling, but it's quite strange to see
>>>> failures that should be tolerated.
>>>>
>>>> Am I missing something?
>>>>
>>>> Regards,
>>>> Tomas
>>>>
>>>
>>>
>>
>