You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Greg Mann <gr...@mesosphere.io> on 2018/10/17 18:25:45 UTC

Request for Comments - Health Check API Proposal

Hi all,
Some users have recently reported issues with our current implementation of
health checks. See this ticket
<https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to
the issue.

To summarize: we currently use a single 'optional bool healthy' field
within the 'TaskStatus' message to indicate the result of a health check.
This allows us to expose 3 health states to users:
1) 'healthy' field is unset = no health check specified, or health check
failed but grace period has not yet elapsed, or health check has not yet
been attempted
2) 'healthy' field is set to 'false' = a health check is specified and it
returned 'false'
3) 'healthy' field is set to 'true' = a health check is specified and it
returned 'true'

The issue is that some users need to distinguish between the three
scenarios in #1: no health check is specified, OR the task is not yet
healthy but we are in the grace period. An example use case would be a load
balancer which needs to wait for a healthy status to route traffic, but
which immediately routes traffic to tasks which have no health check
defined.

This issue was recognized during the design of Mesos generalized checks;
for those checks, we use the presence of the 'check_status' field to
indicate whether or not a check is defined for the task. While consumers
could make use of generalized checks as a workaround, this does not allow
them to both detect the presence of a check AND achieve the task-killing
behavior that health checks provide.

In order to address this, I would like to propose the following new
message, and an addition to the 'TaskStatus' message:

message HealthCheckStatusInfo {
  enum Status {
    UNKNOWN = 0;
    HEALTHY = 1;
    UNHEALTHY = 2;
  }

  required Status status = 0;
}

message TaskStatus {
  . . .

  optional HealthCheckStatusInfo health_check_status = 17;

  . . .
}

The semantics of these fields would be as follows:

'health_status' field:
- If set, a health check has been set
- If unset, a health check has not been set

'health_status.status' field:
- UNKNOWN: The task has not become healthy but is still within its grace
period (this state is also used if an internal error prevents us from
running the health check successfully)
- HEALTHY: The health check indicates the task is healthy
- UNHEALTHY: The health check indicates the task is not healthy

This change would also involve deprecating the existing 'healthy' field. In
accordance with our deprecation policy, I believe we could not remove the
deprecated field until we have a new major release (2.x).

I'd love to hear feedback on this proposal, thanks in advance! I'll also
add this as an agenda item to our upcoming API working group meeting on
Tuesday, Oct. 16 at 11am PST.

Cheers,
Greg

Re: Request for Comments - Health Check API Proposal

Posted by Greg Mann <gr...@mesosphere.io>.
Discussed with AlexR, and we're going to do a bit of due diligence to
ensure that the changes we make now will fit with the future we envision
for health checks. I'll publish a short document on this thread once we've
done so.

Thanks!
Greg

On Thu, Oct 18, 2018 at 1:40 PM Alex Rukletsov <al...@mesosphere.com> wrote:

> Why do we need to resolve this note now? It is obvious that health
> interpretation must be part of the API. I'm not sure I understand what
> concerns you have, Vinod.
>
> On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone <vi...@apache.org> wrote:
>
> > I understand and am in agreement that `HealthCheckStatusInfo` will have
> > more information than `CheckStatusInfo`.
> >
> > I would like us to put a little more thought into how that would look
> like
> > to be doubly sure that what we are introducing today will be evolvable
> into
> > that envisioned future. We have to live with API changes for a long time,
> > so I would like to see more rigor here (e.g., has the note on top of the
> > `HealthCheckStatusInfo` in the doc
> > <
> >
> https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v
> > >
> > has
> > been discussed/resolved?) to avoid costly changes/deprecations.
> >
> > On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <al...@mesosphere.com>
> > wrote:
> >
> > > Thanks for the thoughts, Vinod! Answers inlined.
> > >
> > > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vi...@apache.org>
> wrote:
> > >
> > > > One of the things we discussed when we added `CheckInfo` and
> > > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool
> > healthy`
> > > > field (inside `TaskStatus`) consistent with the new `Check` format.
> > > >
> > > Correct.
> > >
> > > >
> > > > IIRC, some of the changes we wanted to do were
> > > >
> > > >    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo`
> > proto
> > > >
> > > Correct.
> > >
> > > >    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
> > >
> > >    should be named differently in `HealthCheckInfo` (e.g., `Http`)
> > > >
> > > Likely, yes.
> > >
> > > >    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> > > >    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> > > >
> > > Correct.
> > >
> > > >
> > > > Right now, the proposal seems to only address the last point without
> > > > addressing the first two, which feels weird to me. I would prefer to
> > see
> > > > them addressed in one shot.
> > > >
> > > Can you please explain why? Is there any problem you foresee if we do
> it
> > > step by step? Introducing `HealthCheckStatusInfo` now solves an
> important
> > > problem and does not seem to introduce new issues.
> > >
> > > >
> > > > Additionally, the proposed `HealthCheckStatusInfo` proto looks
> > completely
> > > > different from `CheckStatusInfo`. Is that intentional? I hope we are
> > not
> > > > thinking of deprecating it again when we come around to fix
> > `HealthCheck`
> > > > proto to be consistent with `CheckInfo` ?
> > > >
> > > How do you think it should look like? Why will we deprecate it?
> > >
> > > Health checks are different from checks in the way the result of a
> check
> > is
> > > interpreted on the agent. In other words health check is an extra step
> on
> > > top of a check. We might include `CheckStatusInfo` or its contents into
> > > `HealthCheckStatusInfo`, but... should we think about this now? It is
> > nice
> > > to have lower level info from the check in the heath status update, but
> > it
> > > also means more data to transfer. But interpretation—health—we
> definitely
> > > need.
> > >
> > > Greg, I'm +1 on your proposal.
> > >
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io>
> wrote:
> > > >
> > > > > Hi all,
> > > > > Some users have recently reported issues with our current
> > > implementation
> > > > > of health checks. See this ticket
> > > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an
> > introduction
> > > > to
> > > > > the issue.
> > > > >
> > > > > To summarize: we currently use a single 'optional bool healthy'
> field
> > > > > within the 'TaskStatus' message to indicate the result of a health
> > > check.
> > > > > This allows us to expose 3 health states to users:
> > > > > 1) 'healthy' field is unset = no health check specified, or health
> > > check
> > > > > failed but grace period has not yet elapsed, or health check has
> not
> > > yet
> > > > > been attempted
> > > > > 2) 'healthy' field is set to 'false' = a health check is specified
> > and
> > > it
> > > > > returned 'false'
> > > > > 3) 'healthy' field is set to 'true' = a health check is specified
> and
> > > it
> > > > > returned 'true'
> > > > >
> > > > > The issue is that some users need to distinguish between the three
> > > > > scenarios in #1: no health check is specified, OR the task is not
> yet
> > > > > healthy but we are in the grace period. An example use case would
> be
> > a
> > > > load
> > > > > balancer which needs to wait for a healthy status to route traffic,
> > but
> > > > > which immediately routes traffic to tasks which have no health
> check
> > > > > defined.
> > > > >
> > > > > This issue was recognized during the design of Mesos generalized
> > > checks;
> > > > > for those checks, we use the presence of the 'check_status' field
> to
> > > > > indicate whether or not a check is defined for the task. While
> > > consumers
> > > > > could make use of generalized checks as a workaround, this does not
> > > allow
> > > > > them to both detect the presence of a check AND achieve the
> > > task-killing
> > > > > behavior that health checks provide.
> > > > >
> > > > > In order to address this, I would like to propose the following new
> > > > > message, and an addition to the 'TaskStatus' message:
> > > > >
> > > > > message HealthCheckStatusInfo {
> > > > >   enum Status {
> > > > >     UNKNOWN = 0;
> > > > >     HEALTHY = 1;
> > > > >     UNHEALTHY = 2;
> > > > >   }
> > > > >
> > > > >   required Status status = 0;
> > > > > }
> > > > >
> > > > > message TaskStatus {
> > > > >   . . .
> > > > >
> > > > >   optional HealthCheckStatusInfo health_check_status = 17;
> > > > >
> > > > >   . . .
> > > > > }
> > > > >
> > > > > The semantics of these fields would be as follows:
> > > > >
> > > > > 'health_status' field:
> > > > > - If set, a health check has been set
> > > > > - If unset, a health check has not been set
> > > > >
> > > > > 'health_status.status' field:
> > > > > - UNKNOWN: The task has not become healthy but is still within its
> > > grace
> > > > > period (this state is also used if an internal error prevents us
> from
> > > > > running the health check successfully)
> > > > > - HEALTHY: The health check indicates the task is healthy
> > > > > - UNHEALTHY: The health check indicates the task is not healthy
> > > > >
> > > > > This change would also involve deprecating the existing 'healthy'
> > > field.
> > > > > In accordance with our deprecation policy, I believe we could not
> > > remove
> > > > > the deprecated field until we have a new major release (2.x).
> > > > >
> > > > > I'd love to hear feedback on this proposal, thanks in advance! I'll
> > > also
> > > > > add this as an agenda item to our upcoming API working group
> meeting
> > on
> > > > > Tuesday, Oct. 16 at 11am PST.
> > > > >
> > > > > Cheers,
> > > > > Greg
> > > > >
> > > >
> > >
> >
>

Re: Request for Comments - Health Check API Proposal

Posted by Alex Rukletsov <al...@mesosphere.com>.
Why do we need to resolve this note now? It is obvious that health
interpretation must be part of the API. I'm not sure I understand what
concerns you have, Vinod.

On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone <vi...@apache.org> wrote:

> I understand and am in agreement that `HealthCheckStatusInfo` will have
> more information than `CheckStatusInfo`.
>
> I would like us to put a little more thought into how that would look like
> to be doubly sure that what we are introducing today will be evolvable into
> that envisioned future. We have to live with API changes for a long time,
> so I would like to see more rigor here (e.g., has the note on top of the
> `HealthCheckStatusInfo` in the doc
> <
> https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v
> >
> has
> been discussed/resolved?) to avoid costly changes/deprecations.
>
> On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <al...@mesosphere.com>
> wrote:
>
> > Thanks for the thoughts, Vinod! Answers inlined.
> >
> > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vi...@apache.org> wrote:
> >
> > > One of the things we discussed when we added `CheckInfo` and
> > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool
> healthy`
> > > field (inside `TaskStatus`) consistent with the new `Check` format.
> > >
> > Correct.
> >
> > >
> > > IIRC, some of the changes we wanted to do were
> > >
> > >    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo`
> proto
> > >
> > Correct.
> >
> > >    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
> >
> >    should be named differently in `HealthCheckInfo` (e.g., `Http`)
> > >
> > Likely, yes.
> >
> > >    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> > >    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> > >
> > Correct.
> >
> > >
> > > Right now, the proposal seems to only address the last point without
> > > addressing the first two, which feels weird to me. I would prefer to
> see
> > > them addressed in one shot.
> > >
> > Can you please explain why? Is there any problem you foresee if we do it
> > step by step? Introducing `HealthCheckStatusInfo` now solves an important
> > problem and does not seem to introduce new issues.
> >
> > >
> > > Additionally, the proposed `HealthCheckStatusInfo` proto looks
> completely
> > > different from `CheckStatusInfo`. Is that intentional? I hope we are
> not
> > > thinking of deprecating it again when we come around to fix
> `HealthCheck`
> > > proto to be consistent with `CheckInfo` ?
> > >
> > How do you think it should look like? Why will we deprecate it?
> >
> > Health checks are different from checks in the way the result of a check
> is
> > interpreted on the agent. In other words health check is an extra step on
> > top of a check. We might include `CheckStatusInfo` or its contents into
> > `HealthCheckStatusInfo`, but... should we think about this now? It is
> nice
> > to have lower level info from the check in the heath status update, but
> it
> > also means more data to transfer. But interpretation—health—we definitely
> > need.
> >
> > Greg, I'm +1 on your proposal.
> >
> > >
> > > Thanks,
> > >
> > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io> wrote:
> > >
> > > > Hi all,
> > > > Some users have recently reported issues with our current
> > implementation
> > > > of health checks. See this ticket
> > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an
> introduction
> > > to
> > > > the issue.
> > > >
> > > > To summarize: we currently use a single 'optional bool healthy' field
> > > > within the 'TaskStatus' message to indicate the result of a health
> > check.
> > > > This allows us to expose 3 health states to users:
> > > > 1) 'healthy' field is unset = no health check specified, or health
> > check
> > > > failed but grace period has not yet elapsed, or health check has not
> > yet
> > > > been attempted
> > > > 2) 'healthy' field is set to 'false' = a health check is specified
> and
> > it
> > > > returned 'false'
> > > > 3) 'healthy' field is set to 'true' = a health check is specified and
> > it
> > > > returned 'true'
> > > >
> > > > The issue is that some users need to distinguish between the three
> > > > scenarios in #1: no health check is specified, OR the task is not yet
> > > > healthy but we are in the grace period. An example use case would be
> a
> > > load
> > > > balancer which needs to wait for a healthy status to route traffic,
> but
> > > > which immediately routes traffic to tasks which have no health check
> > > > defined.
> > > >
> > > > This issue was recognized during the design of Mesos generalized
> > checks;
> > > > for those checks, we use the presence of the 'check_status' field to
> > > > indicate whether or not a check is defined for the task. While
> > consumers
> > > > could make use of generalized checks as a workaround, this does not
> > allow
> > > > them to both detect the presence of a check AND achieve the
> > task-killing
> > > > behavior that health checks provide.
> > > >
> > > > In order to address this, I would like to propose the following new
> > > > message, and an addition to the 'TaskStatus' message:
> > > >
> > > > message HealthCheckStatusInfo {
> > > >   enum Status {
> > > >     UNKNOWN = 0;
> > > >     HEALTHY = 1;
> > > >     UNHEALTHY = 2;
> > > >   }
> > > >
> > > >   required Status status = 0;
> > > > }
> > > >
> > > > message TaskStatus {
> > > >   . . .
> > > >
> > > >   optional HealthCheckStatusInfo health_check_status = 17;
> > > >
> > > >   . . .
> > > > }
> > > >
> > > > The semantics of these fields would be as follows:
> > > >
> > > > 'health_status' field:
> > > > - If set, a health check has been set
> > > > - If unset, a health check has not been set
> > > >
> > > > 'health_status.status' field:
> > > > - UNKNOWN: The task has not become healthy but is still within its
> > grace
> > > > period (this state is also used if an internal error prevents us from
> > > > running the health check successfully)
> > > > - HEALTHY: The health check indicates the task is healthy
> > > > - UNHEALTHY: The health check indicates the task is not healthy
> > > >
> > > > This change would also involve deprecating the existing 'healthy'
> > field.
> > > > In accordance with our deprecation policy, I believe we could not
> > remove
> > > > the deprecated field until we have a new major release (2.x).
> > > >
> > > > I'd love to hear feedback on this proposal, thanks in advance! I'll
> > also
> > > > add this as an agenda item to our upcoming API working group meeting
> on
> > > > Tuesday, Oct. 16 at 11am PST.
> > > >
> > > > Cheers,
> > > > Greg
> > > >
> > >
> >
>

Re: Request for Comments - Health Check API Proposal

Posted by Vinod Kone <vi...@apache.org>.
I understand and am in agreement that `HealthCheckStatusInfo` will have
more information than `CheckStatusInfo`.

I would like us to put a little more thought into how that would look like
to be doubly sure that what we are introducing today will be evolvable into
that envisioned future. We have to live with API changes for a long time,
so I would like to see more rigor here (e.g., has the note on top of the
`HealthCheckStatusInfo` in the doc
<https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v>
has
been discussed/resolved?) to avoid costly changes/deprecations.

On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <al...@mesosphere.com> wrote:

> Thanks for the thoughts, Vinod! Answers inlined.
>
> On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vi...@apache.org> wrote:
>
> > One of the things we discussed when we added `CheckInfo` and
> > `CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
> > field (inside `TaskStatus`) consistent with the new `Check` format.
> >
> Correct.
>
> >
> > IIRC, some of the changes we wanted to do were
> >
> >    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
> >
> Correct.
>
> >    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
>
>    should be named differently in `HealthCheckInfo` (e.g., `Http`)
> >
> Likely, yes.
>
> >    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> >    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> >
> Correct.
>
> >
> > Right now, the proposal seems to only address the last point without
> > addressing the first two, which feels weird to me. I would prefer to see
> > them addressed in one shot.
> >
> Can you please explain why? Is there any problem you foresee if we do it
> step by step? Introducing `HealthCheckStatusInfo` now solves an important
> problem and does not seem to introduce new issues.
>
> >
> > Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
> > different from `CheckStatusInfo`. Is that intentional? I hope we are not
> > thinking of deprecating it again when we come around to fix `HealthCheck`
> > proto to be consistent with `CheckInfo` ?
> >
> How do you think it should look like? Why will we deprecate it?
>
> Health checks are different from checks in the way the result of a check is
> interpreted on the agent. In other words health check is an extra step on
> top of a check. We might include `CheckStatusInfo` or its contents into
> `HealthCheckStatusInfo`, but... should we think about this now? It is nice
> to have lower level info from the check in the heath status update, but it
> also means more data to transfer. But interpretation—health—we definitely
> need.
>
> Greg, I'm +1 on your proposal.
>
> >
> > Thanks,
> >
> > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io> wrote:
> >
> > > Hi all,
> > > Some users have recently reported issues with our current
> implementation
> > > of health checks. See this ticket
> > > <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction
> > to
> > > the issue.
> > >
> > > To summarize: we currently use a single 'optional bool healthy' field
> > > within the 'TaskStatus' message to indicate the result of a health
> check.
> > > This allows us to expose 3 health states to users:
> > > 1) 'healthy' field is unset = no health check specified, or health
> check
> > > failed but grace period has not yet elapsed, or health check has not
> yet
> > > been attempted
> > > 2) 'healthy' field is set to 'false' = a health check is specified and
> it
> > > returned 'false'
> > > 3) 'healthy' field is set to 'true' = a health check is specified and
> it
> > > returned 'true'
> > >
> > > The issue is that some users need to distinguish between the three
> > > scenarios in #1: no health check is specified, OR the task is not yet
> > > healthy but we are in the grace period. An example use case would be a
> > load
> > > balancer which needs to wait for a healthy status to route traffic, but
> > > which immediately routes traffic to tasks which have no health check
> > > defined.
> > >
> > > This issue was recognized during the design of Mesos generalized
> checks;
> > > for those checks, we use the presence of the 'check_status' field to
> > > indicate whether or not a check is defined for the task. While
> consumers
> > > could make use of generalized checks as a workaround, this does not
> allow
> > > them to both detect the presence of a check AND achieve the
> task-killing
> > > behavior that health checks provide.
> > >
> > > In order to address this, I would like to propose the following new
> > > message, and an addition to the 'TaskStatus' message:
> > >
> > > message HealthCheckStatusInfo {
> > >   enum Status {
> > >     UNKNOWN = 0;
> > >     HEALTHY = 1;
> > >     UNHEALTHY = 2;
> > >   }
> > >
> > >   required Status status = 0;
> > > }
> > >
> > > message TaskStatus {
> > >   . . .
> > >
> > >   optional HealthCheckStatusInfo health_check_status = 17;
> > >
> > >   . . .
> > > }
> > >
> > > The semantics of these fields would be as follows:
> > >
> > > 'health_status' field:
> > > - If set, a health check has been set
> > > - If unset, a health check has not been set
> > >
> > > 'health_status.status' field:
> > > - UNKNOWN: The task has not become healthy but is still within its
> grace
> > > period (this state is also used if an internal error prevents us from
> > > running the health check successfully)
> > > - HEALTHY: The health check indicates the task is healthy
> > > - UNHEALTHY: The health check indicates the task is not healthy
> > >
> > > This change would also involve deprecating the existing 'healthy'
> field.
> > > In accordance with our deprecation policy, I believe we could not
> remove
> > > the deprecated field until we have a new major release (2.x).
> > >
> > > I'd love to hear feedback on this proposal, thanks in advance! I'll
> also
> > > add this as an agenda item to our upcoming API working group meeting on
> > > Tuesday, Oct. 16 at 11am PST.
> > >
> > > Cheers,
> > > Greg
> > >
> >
>

Re: Request for Comments - Health Check API Proposal

Posted by Alex Rukletsov <al...@mesosphere.com>.
Thanks for the thoughts, Vinod! Answers inlined.

On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vi...@apache.org> wrote:

> One of the things we discussed when we added `CheckInfo` and
> `CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
> field (inside `TaskStatus`) consistent with the new `Check` format.
>
Correct.

>
> IIRC, some of the changes we wanted to do were
>
>    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
>
Correct.

>    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)

   should be named differently in `HealthCheckInfo` (e.g., `Http`)
>
Likely, yes.

>    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
>    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
>
Correct.

>
> Right now, the proposal seems to only address the last point without
> addressing the first two, which feels weird to me. I would prefer to see
> them addressed in one shot.
>
Can you please explain why? Is there any problem you foresee if we do it
step by step? Introducing `HealthCheckStatusInfo` now solves an important
problem and does not seem to introduce new issues.

>
> Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
> different from `CheckStatusInfo`. Is that intentional? I hope we are not
> thinking of deprecating it again when we come around to fix `HealthCheck`
> proto to be consistent with `CheckInfo` ?
>
How do you think it should look like? Why will we deprecate it?

Health checks are different from checks in the way the result of a check is
interpreted on the agent. In other words health check is an extra step on
top of a check. We might include `CheckStatusInfo` or its contents into
`HealthCheckStatusInfo`, but... should we think about this now? It is nice
to have lower level info from the check in the heath status update, but it
also means more data to transfer. But interpretation—health—we definitely
need.

Greg, I'm +1 on your proposal.

>
> Thanks,
>
> On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io> wrote:
>
> > Hi all,
> > Some users have recently reported issues with our current implementation
> > of health checks. See this ticket
> > <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction
> to
> > the issue.
> >
> > To summarize: we currently use a single 'optional bool healthy' field
> > within the 'TaskStatus' message to indicate the result of a health check.
> > This allows us to expose 3 health states to users:
> > 1) 'healthy' field is unset = no health check specified, or health check
> > failed but grace period has not yet elapsed, or health check has not yet
> > been attempted
> > 2) 'healthy' field is set to 'false' = a health check is specified and it
> > returned 'false'
> > 3) 'healthy' field is set to 'true' = a health check is specified and it
> > returned 'true'
> >
> > The issue is that some users need to distinguish between the three
> > scenarios in #1: no health check is specified, OR the task is not yet
> > healthy but we are in the grace period. An example use case would be a
> load
> > balancer which needs to wait for a healthy status to route traffic, but
> > which immediately routes traffic to tasks which have no health check
> > defined.
> >
> > This issue was recognized during the design of Mesos generalized checks;
> > for those checks, we use the presence of the 'check_status' field to
> > indicate whether or not a check is defined for the task. While consumers
> > could make use of generalized checks as a workaround, this does not allow
> > them to both detect the presence of a check AND achieve the task-killing
> > behavior that health checks provide.
> >
> > In order to address this, I would like to propose the following new
> > message, and an addition to the 'TaskStatus' message:
> >
> > message HealthCheckStatusInfo {
> >   enum Status {
> >     UNKNOWN = 0;
> >     HEALTHY = 1;
> >     UNHEALTHY = 2;
> >   }
> >
> >   required Status status = 0;
> > }
> >
> > message TaskStatus {
> >   . . .
> >
> >   optional HealthCheckStatusInfo health_check_status = 17;
> >
> >   . . .
> > }
> >
> > The semantics of these fields would be as follows:
> >
> > 'health_status' field:
> > - If set, a health check has been set
> > - If unset, a health check has not been set
> >
> > 'health_status.status' field:
> > - UNKNOWN: The task has not become healthy but is still within its grace
> > period (this state is also used if an internal error prevents us from
> > running the health check successfully)
> > - HEALTHY: The health check indicates the task is healthy
> > - UNHEALTHY: The health check indicates the task is not healthy
> >
> > This change would also involve deprecating the existing 'healthy' field.
> > In accordance with our deprecation policy, I believe we could not remove
> > the deprecated field until we have a new major release (2.x).
> >
> > I'd love to hear feedback on this proposal, thanks in advance! I'll also
> > add this as an agenda item to our upcoming API working group meeting on
> > Tuesday, Oct. 16 at 11am PST.
> >
> > Cheers,
> > Greg
> >
>

Re: Request for Comments - Health Check API Proposal

Posted by Vinod Kone <vi...@apache.org>.
One of the things we discussed when we added `CheckInfo` and
`CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
field (inside `TaskStatus`) consistent with the new `Check` format.

IIRC, some of the changes we wanted to do were

   - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
   - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
   should be named differently in `HealthCheckInfo` (e.g., `Http`)
   - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
   `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`

Right now, the proposal seems to only address the last point without
addressing the first two, which feels weird to me. I would prefer to see
them addressed in one shot.

Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
different from `CheckStatusInfo`. Is that intentional? I hope we are not
thinking of deprecating it again when we come around to fix `HealthCheck`
proto to be consistent with `CheckInfo` ?

Thanks,

On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io> wrote:

> Hi all,
> Some users have recently reported issues with our current implementation
> of health checks. See this ticket
> <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to
> the issue.
>
> To summarize: we currently use a single 'optional bool healthy' field
> within the 'TaskStatus' message to indicate the result of a health check.
> This allows us to expose 3 health states to users:
> 1) 'healthy' field is unset = no health check specified, or health check
> failed but grace period has not yet elapsed, or health check has not yet
> been attempted
> 2) 'healthy' field is set to 'false' = a health check is specified and it
> returned 'false'
> 3) 'healthy' field is set to 'true' = a health check is specified and it
> returned 'true'
>
> The issue is that some users need to distinguish between the three
> scenarios in #1: no health check is specified, OR the task is not yet
> healthy but we are in the grace period. An example use case would be a load
> balancer which needs to wait for a healthy status to route traffic, but
> which immediately routes traffic to tasks which have no health check
> defined.
>
> This issue was recognized during the design of Mesos generalized checks;
> for those checks, we use the presence of the 'check_status' field to
> indicate whether or not a check is defined for the task. While consumers
> could make use of generalized checks as a workaround, this does not allow
> them to both detect the presence of a check AND achieve the task-killing
> behavior that health checks provide.
>
> In order to address this, I would like to propose the following new
> message, and an addition to the 'TaskStatus' message:
>
> message HealthCheckStatusInfo {
>   enum Status {
>     UNKNOWN = 0;
>     HEALTHY = 1;
>     UNHEALTHY = 2;
>   }
>
>   required Status status = 0;
> }
>
> message TaskStatus {
>   . . .
>
>   optional HealthCheckStatusInfo health_check_status = 17;
>
>   . . .
> }
>
> The semantics of these fields would be as follows:
>
> 'health_status' field:
> - If set, a health check has been set
> - If unset, a health check has not been set
>
> 'health_status.status' field:
> - UNKNOWN: The task has not become healthy but is still within its grace
> period (this state is also used if an internal error prevents us from
> running the health check successfully)
> - HEALTHY: The health check indicates the task is healthy
> - UNHEALTHY: The health check indicates the task is not healthy
>
> This change would also involve deprecating the existing 'healthy' field.
> In accordance with our deprecation policy, I believe we could not remove
> the deprecated field until we have a new major release (2.x).
>
> I'd love to hear feedback on this proposal, thanks in advance! I'll also
> add this as an agenda item to our upcoming API working group meeting on
> Tuesday, Oct. 16 at 11am PST.
>
> Cheers,
> Greg
>

Re: Request for Comments - Health Check API Proposal

Posted by Vinod Kone <vi...@apache.org>.
One of the things we discussed when we added `CheckInfo` and
`CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
field (inside `TaskStatus`) consistent with the new `Check` format.

IIRC, some of the changes we wanted to do were

   - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
   - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
   should be named differently in `HealthCheckInfo` (e.g., `Http`)
   - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
   `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`

Right now, the proposal seems to only address the last point without
addressing the first two, which feels weird to me. I would prefer to see
them addressed in one shot.

Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
different from `CheckStatusInfo`. Is that intentional? I hope we are not
thinking of deprecating it again when we come around to fix `HealthCheck`
proto to be consistent with `CheckInfo` ?

Thanks,

On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <gr...@mesosphere.io> wrote:

> Hi all,
> Some users have recently reported issues with our current implementation
> of health checks. See this ticket
> <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to
> the issue.
>
> To summarize: we currently use a single 'optional bool healthy' field
> within the 'TaskStatus' message to indicate the result of a health check.
> This allows us to expose 3 health states to users:
> 1) 'healthy' field is unset = no health check specified, or health check
> failed but grace period has not yet elapsed, or health check has not yet
> been attempted
> 2) 'healthy' field is set to 'false' = a health check is specified and it
> returned 'false'
> 3) 'healthy' field is set to 'true' = a health check is specified and it
> returned 'true'
>
> The issue is that some users need to distinguish between the three
> scenarios in #1: no health check is specified, OR the task is not yet
> healthy but we are in the grace period. An example use case would be a load
> balancer which needs to wait for a healthy status to route traffic, but
> which immediately routes traffic to tasks which have no health check
> defined.
>
> This issue was recognized during the design of Mesos generalized checks;
> for those checks, we use the presence of the 'check_status' field to
> indicate whether or not a check is defined for the task. While consumers
> could make use of generalized checks as a workaround, this does not allow
> them to both detect the presence of a check AND achieve the task-killing
> behavior that health checks provide.
>
> In order to address this, I would like to propose the following new
> message, and an addition to the 'TaskStatus' message:
>
> message HealthCheckStatusInfo {
>   enum Status {
>     UNKNOWN = 0;
>     HEALTHY = 1;
>     UNHEALTHY = 2;
>   }
>
>   required Status status = 0;
> }
>
> message TaskStatus {
>   . . .
>
>   optional HealthCheckStatusInfo health_check_status = 17;
>
>   . . .
> }
>
> The semantics of these fields would be as follows:
>
> 'health_status' field:
> - If set, a health check has been set
> - If unset, a health check has not been set
>
> 'health_status.status' field:
> - UNKNOWN: The task has not become healthy but is still within its grace
> period (this state is also used if an internal error prevents us from
> running the health check successfully)
> - HEALTHY: The health check indicates the task is healthy
> - UNHEALTHY: The health check indicates the task is not healthy
>
> This change would also involve deprecating the existing 'healthy' field.
> In accordance with our deprecation policy, I believe we could not remove
> the deprecated field until we have a new major release (2.x).
>
> I'd love to hear feedback on this proposal, thanks in advance! I'll also
> add this as an agenda item to our upcoming API working group meeting on
> Tuesday, Oct. 16 at 11am PST.
>
> Cheers,
> Greg
>