You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@openwhisk.apache.org by Tyson Norris <tn...@adobe.com.INVALID> on 2019/10/27 00:47:54 UTC

Action health checks

Hi Whiskers –
We periodically have an unfortunate problem where a docker container (or worse, many of them) dies off unexpectedly, outside of HTTP usage from invoker. In these cases, prewarm or warm containers may still have references at the Invoker, and eventually if an activation arrives that matches those container references, the HTTP workflow starts and fails immediately since the node is not listening anymore, resulting in failed activations. Or, any even worse situation, can be when a container failed earlier, and a new container, initialized with a different action is initialized on the same host and port (more likely a problem for k8s/mesos cluster usage).

To mitigate these issues, I put together a health check process [1] from invoker to action containers, where we can test

  *   prewarm containers periodically to verify they are still operational, and
  *   warm containers immediately after resuming them (before HTTP requests are sent)
In case of prewarm failure, we should backfill the prewarms to the specified config count.
In case of warm failure, the activation is rescheduled to ContainerPool, which typically would either route to a different prewarm, or start a new cold container.

The test ping is in the form of tcp connection only, since we otherwise need to update the HTTP protocol implemented by all runtimes. This test is good enough for the worst case of “container has gone missing”, but cannot test for more subtle problems like “/run endpoint is broken”. There could be other checks to increase the quality of test we add in the future, but most of this I think requires expanding the HTTP protocol and state managed at the container, and I wanted to get something working for basic functionality to start with.

Let me know if you have opinions about this, and we can discuss  here or in the PR.
Thanks
Tyson

[1] https://github.com/apache/openwhisk/pull/4698

Re: Action health checks

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Mon, Oct 28, 2019 at 4:18 PM Tyson Norris <tn...@adobe.com.invalid> wrote:
> One possible change would be to move the checking logic inside either Container or
> ContainerClient, but I would keep it separate from /init and /run, and consider revisiting
> it if we change the HTTP protocol to include some more sophisticated checking
> via HTTP ( add a /health endpoint etc)....

Reading this thread I was thinking about /health, I think it makes
perfect sense to consider the action containers like any other
service, which need to provide a /health endpoint separate from their
operational endpoints.

The health check might try /health first and fallback to something
else if that's not provided, to enable a smooth transition to
health-enabled containers over time.

-Bertrand

Re: Action health checks

Posted by Tyson Norris <tn...@adobe.com.INVALID>.

Hi - 
As discussed, I have updated the PR to reflect:
    > - for prewarm, use the tcp connection for monitoring outside of activation
    > workflow
    > - for warm, handle it as a case of retry, where request *connection*
    > failure only for /run, will be handled by way of rescheduling back to
    > ContainerPool (/init should already be handled by retry for a time period).

Please review and provide any feedback.
https://github.com/apache/openwhisk/pull/4698

Thanks!
Tyson

On 10/30/19, 9:03 AM, "Markus Thömmes" <ma...@apache.org> wrote:

    Yes, I used the word "retry" here to mean "reschedule to another
    container", just like you would if the healthiness probe failed.
    
    A word of caution: TCP probes might be behaving strangely in a container
    setting. They sometimes accept connections even though nothing is listening
    and stuff like that.
    
    Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
    <tn...@adobe.com.invalid>:
    
    > I don't think "retry" is the right handling for warm connection failures -
    > if a connection cannot be made due to container crash/removal, it won't
    > suddenly come back. I would instead treat it as a "reschedule", where the
    > failure routes the activation back to ContainerPool, to be scheduled to a
    > different container. I'm not sure how distinct we can be on detecting
    > contrainer failure vs temporary network issue that may or may not resolve
    > on its own, so I would treat them the same, and assume the container is
    > gone.
    >
    > So for this PR, is there any objection to:
    > - for prewarm, use the tcp connection for monitoring outside of activation
    > workflow
    > - for warm, handle it as a case of retry, where request *connection*
    > failure only for /run, will be handled by way of rescheduling back to
    > ContainerPool (/init should already be handled by retry for a time period).
    >
    > Thanks!
    > Tyson
    >
    > On 10/30/19, 7:03 AM, "Markus Thömmes" <ma...@apache.org> wrote:
    >
    >     Increasing latency would be my biggest concern here as well. With a
    > health
    >     ping, we can't even be sure that a container is still healthy for the
    > "real
    >     request". To guarantee that, I'd still propose to have a look at the
    >     possible failure modes and implement a retry mechanism on them. If you
    > get
    >     a "connection refused" error, I'm fairly certain that it can be retried
    >     without harm. In fact, any error where we can guarantee that we haven't
    >     actually reached the container can be safely retried in the described
    > way.
    >
    >     Pre-warmed containers indeed are somewhat of a different story. A
    > health
    >     ping as mentioned here would for sure help there, be it just a TCP
    > probe or
    >     even a full-fledged /health call. I'd be fine with either way in this
    > case
    >     as it doesn't affect the critical path.
    >
    >     Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
    >     <tn...@adobe.com.invalid>:
    >
    >     > By "critical path" you mean the path during action invocation?
    >     > The current PR only introduces latency on that path for the case of a
    >     > Paused container changing to Running state (once per transition from
    > Paused
    >     > -> Running).
    >     > In case it isn't clear, this change does not affect any retry (or
    > lack of
    >     > retry) behavior.
    >     >
    >     > Thanks
    >     > Tyson
    >     >
    >     > On 10/29/19, 9:38 AM, "Rodric Rabbah" <ro...@gmail.com> wrote:
    >     >
    >     >     as a longer term point to consider, i think the current model of
    > "best
    >     >     effort at most once" was the wrong design point - if we embraced
    >     > failure
    >     >     and just retried (at least once), then failure at this level
    > would
    >     > lead to
    >     >     retries which is reasonable.
    >     >
    >     >     if we added a third health route or introduced a health check,
    > would we
    >     >     increase the critical path?
    >     >
    >     >     -r
    >     >
    >     >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
    > groved@us.ibm.com>
    >     > wrote:
    >     >
    >     >     > Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019
    >     > 11:17:50 AM:
    >     >     > > I'm curious to know what other
    >     >     > > folks think about "generic active probing from invoker" vs
    > "docker/
    >     >     > > mesos/k8s specific integrations for reacting to container
    >     > failures"?
    >     >     > >
    >     >     >
    >     >     > From a pure maintenance and testing perspective I think a
    > single
    >     > common
    >     >     > mechanism would be best if we can do it with acceptable runtime
    >     > overhead.
    >     >     >
    >     >     > --dave
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >

Re: Action health checks

Posted by Markus Thömmes <ma...@apache.org>.

Yes, I used the word "retry" here to mean "reschedule to another
container", just like you would if the healthiness probe failed.

A word of caution: TCP probes might be behaving strangely in a container
setting. They sometimes accept connections even though nothing is listening
and stuff like that.

Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
<tn...@adobe.com.invalid>:

> I don't think "retry" is the right handling for warm connection failures -
> if a connection cannot be made due to container crash/removal, it won't
> suddenly come back. I would instead treat it as a "reschedule", where the
> failure routes the activation back to ContainerPool, to be scheduled to a
> different container. I'm not sure how distinct we can be on detecting
> contrainer failure vs temporary network issue that may or may not resolve
> on its own, so I would treat them the same, and assume the container is
> gone.
>
> So for this PR, is there any objection to:
> - for prewarm, use the tcp connection for monitoring outside of activation
> workflow
> - for warm, handle it as a case of retry, where request *connection*
> failure only for /run, will be handled by way of rescheduling back to
> ContainerPool (/init should already be handled by retry for a time period).
>
> Thanks!
> Tyson
>
> On 10/30/19, 7:03 AM, "Markus Thömmes" <ma...@apache.org> wrote:
>
>     Increasing latency would be my biggest concern here as well. With a
> health
>     ping, we can't even be sure that a container is still healthy for the
> "real
>     request". To guarantee that, I'd still propose to have a look at the
>     possible failure modes and implement a retry mechanism on them. If you
> get
>     a "connection refused" error, I'm fairly certain that it can be retried
>     without harm. In fact, any error where we can guarantee that we haven't
>     actually reached the container can be safely retried in the described
> way.
>
>     Pre-warmed containers indeed are somewhat of a different story. A
> health
>     ping as mentioned here would for sure help there, be it just a TCP
> probe or
>     even a full-fledged /health call. I'd be fine with either way in this
> case
>     as it doesn't affect the critical path.
>
>     Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
>     <tn...@adobe.com.invalid>:
>
>     > By "critical path" you mean the path during action invocation?
>     > The current PR only introduces latency on that path for the case of a
>     > Paused container changing to Running state (once per transition from
> Paused
>     > -> Running).
>     > In case it isn't clear, this change does not affect any retry (or
> lack of
>     > retry) behavior.
>     >
>     > Thanks
>     > Tyson
>     >
>     > On 10/29/19, 9:38 AM, "Rodric Rabbah" <ro...@gmail.com> wrote:
>     >
>     >     as a longer term point to consider, i think the current model of
> "best
>     >     effort at most once" was the wrong design point - if we embraced
>     > failure
>     >     and just retried (at least once), then failure at this level
> would
>     > lead to
>     >     retries which is reasonable.
>     >
>     >     if we added a third health route or introduced a health check,
> would we
>     >     increase the critical path?
>     >
>     >     -r
>     >
>     >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
> groved@us.ibm.com>
>     > wrote:
>     >
>     >     > Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019
>     > 11:17:50 AM:
>     >     > > I'm curious to know what other
>     >     > > folks think about "generic active probing from invoker" vs
> "docker/
>     >     > > mesos/k8s specific integrations for reacting to container
>     > failures"?
>     >     > >
>     >     >
>     >     > From a pure maintenance and testing perspective I think a
> single
>     > common
>     >     > mechanism would be best if we can do it with acceptable runtime
>     > overhead.
>     >     >
>     >     > --dave
>     >     >
>     >
>     >
>     >
>
>
>

Re: Action health checks

Posted by Tyson Norris <tn...@adobe.com.INVALID>.

I don't think "retry" is the right handling for warm connection failures - if a connection cannot be made due to container crash/removal, it won't suddenly come back. I would instead treat it as a "reschedule", where the failure routes the activation back to ContainerPool, to be scheduled to a different container. I'm not sure how distinct we can be on detecting contrainer failure vs temporary network issue that may or may not resolve on its own, so I would treat them the same, and assume the container is gone.

So for this PR, is there any objection to:
- for prewarm, use the tcp connection for monitoring outside of activation workflow
- for warm, handle it as a case of retry, where request *connection* failure only for /run, will be handled by way of rescheduling back to ContainerPool (/init should already be handled by retry for a time period).

Thanks!
Tyson

On 10/30/19, 7:03 AM, "Markus Thömmes" <ma...@apache.org> wrote:

    Increasing latency would be my biggest concern here as well. With a health
    ping, we can't even be sure that a container is still healthy for the "real
    request". To guarantee that, I'd still propose to have a look at the
    possible failure modes and implement a retry mechanism on them. If you get
    a "connection refused" error, I'm fairly certain that it can be retried
    without harm. In fact, any error where we can guarantee that we haven't
    actually reached the container can be safely retried in the described way.
    
    Pre-warmed containers indeed are somewhat of a different story. A health
    ping as mentioned here would for sure help there, be it just a TCP probe or
    even a full-fledged /health call. I'd be fine with either way in this case
    as it doesn't affect the critical path.
    
    Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
    <tn...@adobe.com.invalid>:
    
    > By "critical path" you mean the path during action invocation?
    > The current PR only introduces latency on that path for the case of a
    > Paused container changing to Running state (once per transition from Paused
    > -> Running).
    > In case it isn't clear, this change does not affect any retry (or lack of
    > retry) behavior.
    >
    > Thanks
    > Tyson
    >
    > On 10/29/19, 9:38 AM, "Rodric Rabbah" <ro...@gmail.com> wrote:
    >
    >     as a longer term point to consider, i think the current model of "best
    >     effort at most once" was the wrong design point - if we embraced
    > failure
    >     and just retried (at least once), then failure at this level would
    > lead to
    >     retries which is reasonable.
    >
    >     if we added a third health route or introduced a health check, would we
    >     increase the critical path?
    >
    >     -r
    >
    >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <gr...@us.ibm.com>
    > wrote:
    >
    >     > Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019
    > 11:17:50 AM:
    >     > > I'm curious to know what other
    >     > > folks think about "generic active probing from invoker" vs "docker/
    >     > > mesos/k8s specific integrations for reacting to container
    > failures"?
    >     > >
    >     >
    >     > From a pure maintenance and testing perspective I think a single
    > common
    >     > mechanism would be best if we can do it with acceptable runtime
    > overhead.
    >     >
    >     > --dave
    >     >
    >
    >
    >

Re: Action health checks

Posted by Markus Thömmes <ma...@apache.org>.

Increasing latency would be my biggest concern here as well. With a health
ping, we can't even be sure that a container is still healthy for the "real
request". To guarantee that, I'd still propose to have a look at the
possible failure modes and implement a retry mechanism on them. If you get
a "connection refused" error, I'm fairly certain that it can be retried
without harm. In fact, any error where we can guarantee that we haven't
actually reached the container can be safely retried in the described way.

Pre-warmed containers indeed are somewhat of a different story. A health
ping as mentioned here would for sure help there, be it just a TCP probe or
even a full-fledged /health call. I'd be fine with either way in this case
as it doesn't affect the critical path.

Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
<tn...@adobe.com.invalid>:

> By "critical path" you mean the path during action invocation?
> The current PR only introduces latency on that path for the case of a
> Paused container changing to Running state (once per transition from Paused
> -> Running).
> In case it isn't clear, this change does not affect any retry (or lack of
> retry) behavior.
>
> Thanks
> Tyson
>
> On 10/29/19, 9:38 AM, "Rodric Rabbah" <ro...@gmail.com> wrote:
>
>     as a longer term point to consider, i think the current model of "best
>     effort at most once" was the wrong design point - if we embraced
> failure
>     and just retried (at least once), then failure at this level would
> lead to
>     retries which is reasonable.
>
>     if we added a third health route or introduced a health check, would we
>     increase the critical path?
>
>     -r
>
>     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <gr...@us.ibm.com>
> wrote:
>
>     > Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019
> 11:17:50 AM:
>     > > I'm curious to know what other
>     > > folks think about "generic active probing from invoker" vs "docker/
>     > > mesos/k8s specific integrations for reacting to container
> failures"?
>     > >
>     >
>     > From a pure maintenance and testing perspective I think a single
> common
>     > mechanism would be best if we can do it with acceptable runtime
> overhead.
>     >
>     > --dave
>     >
>
>
>

Re: Action health checks

Posted by Tyson Norris <tn...@adobe.com.INVALID>.

By "critical path" you mean the path during action invocation? 
The current PR only introduces latency on that path for the case of a Paused container changing to Running state (once per transition from Paused -> Running). 
In case it isn't clear, this change does not affect any retry (or lack of retry) behavior.

Thanks
Tyson

On 10/29/19, 9:38 AM, "Rodric Rabbah" <ro...@gmail.com> wrote:

    as a longer term point to consider, i think the current model of "best
    effort at most once" was the wrong design point - if we embraced failure
    and just retried (at least once), then failure at this level would lead to
    retries which is reasonable.
    
    if we added a third health route or introduced a health check, would we
    increase the critical path?
    
    -r
    
    On Tue, Oct 29, 2019 at 12:29 PM David P Grove <gr...@us.ibm.com> wrote:
    
    > Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019 11:17:50 AM:
    > > I'm curious to know what other
    > > folks think about "generic active probing from invoker" vs "docker/
    > > mesos/k8s specific integrations for reacting to container failures"?
    > >
    >
    > From a pure maintenance and testing perspective I think a single common
    > mechanism would be best if we can do it with acceptable runtime overhead.
    >
    > --dave
    >

Re: Action health checks

Posted by Rodric Rabbah <ro...@gmail.com>.

as a longer term point to consider, i think the current model of "best
effort at most once" was the wrong design point - if we embraced failure
and just retried (at least once), then failure at this level would lead to
retries which is reasonable.

if we added a third health route or introduced a health check, would we
increase the critical path?

-r

On Tue, Oct 29, 2019 at 12:29 PM David P Grove <gr...@us.ibm.com> wrote:

> Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019 11:17:50 AM:
> > I'm curious to know what other
> > folks think about "generic active probing from invoker" vs "docker/
> > mesos/k8s specific integrations for reacting to container failures"?
> >
>
> From a pure maintenance and testing perspective I think a single common
> mechanism would be best if we can do it with acceptable runtime overhead.
>
> --dave
>

RE: Action health checks

Posted by David P Grove <gr...@us.ibm.com>.

Tyson Norris <tn...@adobe.com.INVALID> wrote on 10/28/2019 11:17:50 AM:
> I'm curious to know what other
> folks think about "generic active probing from invoker" vs "docker/
> mesos/k8s specific integrations for reacting to container failures"?
>

From a pure maintenance and testing perspective I think a single common
mechanism would be best if we can do it with acceptable runtime overhead.

--dave

Re: Action health checks

Posted by Tyson Norris <tn...@adobe.com.INVALID>.

Hi Markus - 
The failures are generic and we haven't seen a real cause as of yet, on mesos we get an error of "Container exited with status 125". We continue to investigate that of course, but containers may die for any number of reasons so we should just plan on them dying. We do get an event from mesos already on these failures, and I'm sure we can integrate with Kubernetes to react as well, but I thought it might be better to make this probing simpler and consistent e.g. where DockerContainerFactory can be treated the same way. If nothing else, it is certainly easier to test. I'm curious to know what other folks think about "generic active probing from invoker" vs "docker/mesos/k8s specific integrations for reacting to container failures"?

RE HTTP requests - For prewarm, we cannot add this check there, since e.g. if 20 prewarms fail for this invoker, a single activation might try each of those twenty before getting a working container, which seems like bad behavior compared to preemptively validating the container and replacing it outside the HTTP workflow for prewarms. For warm containers, it would be more feasible to do this but we would need to distinguish "/run after resume" from "/run before pause", and provide a special error case for connection failure after resume since we cannot treat all warm container failures as retriable - only once after resume.  This seemed more complicated than explicitly checking it once after resume inside ContainerProxy.  One possible change would be to move the checking logic inside either Container or ContainerClient, but I would keep it separate from /init and /run, and consider revisiting it if we change the HTTP protocol to include some more sophisticated checking via HTTP ( add a /health endpoint etc). 

Thanks
Tyson

On 10/28/19, 2:21 AM, "Markus Thömmes" <ma...@apache.org> wrote:

    Heya,

    thanks for the elaborate proposal.

    Do you have any more information on why these containers are dying off in
    the first place? In the case of Kubernetes/Mesos I could imagine we might
    want to keep the Invoker's state consistent by checking it against the
    respective API repeatedly. On Kubernetes for instance, you could setup an
    informer that'd inform you about any state changes on the pods that this
    Invoker has spawned. If a prewarm container dies this way, we can simply
    remove it from the Invoker's bookkeeping and trigger a backfill.

    Secondly, could we potentially fold this check into the HTTP requests
    themselves? If we get a "connection refused" on an action that we knew
    worked before, we can safely retry. There should be a set of exceptions
    that our HTTP clients should surface that should be safe for us to retry in
    the invoker anyway. The only addition you'd need in this case is an
    enhancement on the ContainerProxy's state machine I believe, that allows
    for such a retrying use-case. The "connection refused" use-case I mentioned
    should be equivalent to the TCP probe you're doing now.

    WDYT?

    Cheers,
    Markus

    Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
    <tn...@adobe.com.invalid>:

    > Hi Whiskers –
    > We periodically have an unfortunate problem where a docker container (or
    > worse, many of them) dies off unexpectedly, outside of HTTP usage from
    > invoker. In these cases, prewarm or warm containers may still have
    > references at the Invoker, and eventually if an activation arrives that
    > matches those container references, the HTTP workflow starts and fails
    > immediately since the node is not listening anymore, resulting in failed
    > activations. Or, any even worse situation, can be when a container failed
    > earlier, and a new container, initialized with a different action is
    > initialized on the same host and port (more likely a problem for k8s/mesos
    > cluster usage).
    >
    > To mitigate these issues, I put together a health check process [1] from
    > invoker to action containers, where we can test
    >
    >   *   prewarm containers periodically to verify they are still
    > operational, and
    >   *   warm containers immediately after resuming them (before HTTP
    > requests are sent)
    > In case of prewarm failure, we should backfill the prewarms to the
    > specified config count.
    > In case of warm failure, the activation is rescheduled to ContainerPool,
    > which typically would either route to a different prewarm, or start a new
    > cold container.
    >
    > The test ping is in the form of tcp connection only, since we otherwise
    > need to update the HTTP protocol implemented by all runtimes. This test is
    > good enough for the worst case of “container has gone missing”, but cannot
    > test for more subtle problems like “/run endpoint is broken”. There could
    > be other checks to increase the quality of test we add in the future, but
    > most of this I think requires expanding the HTTP protocol and state managed
    > at the container, and I wanted to get something working for basic
    > functionality to start with.
    >
    > Let me know if you have opinions about this, and we can discuss  here or
    > in the PR.
    > Thanks
    > Tyson
    >
    > [1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fopenwhisk%2Fpull%2F4698&amp;data=02%7C01%7Ctnorris%40adobe.com%7Cf082e59d6ff049a9235a08d75b8846f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637078513173443883&amp;sdata=4ngLog%2BJjK1MQBVTICLmzBBystdBPfLhV1HBi%2BbogXc%3D&amp;reserved=0
    >

Re: Action health checks

Posted by Markus Thömmes <ma...@apache.org>.

Heya,

thanks for the elaborate proposal.

Do you have any more information on why these containers are dying off in
the first place? In the case of Kubernetes/Mesos I could imagine we might
want to keep the Invoker's state consistent by checking it against the
respective API repeatedly. On Kubernetes for instance, you could setup an
informer that'd inform you about any state changes on the pods that this
Invoker has spawned. If a prewarm container dies this way, we can simply
remove it from the Invoker's bookkeeping and trigger a backfill.

Secondly, could we potentially fold this check into the HTTP requests
themselves? If we get a "connection refused" on an action that we knew
worked before, we can safely retry. There should be a set of exceptions
that our HTTP clients should surface that should be safe for us to retry in
the invoker anyway. The only addition you'd need in this case is an
enhancement on the ContainerProxy's state machine I believe, that allows
for such a retrying use-case. The "connection refused" use-case I mentioned
should be equivalent to the TCP probe you're doing now.

WDYT?

Cheers,
Markus

Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
<tn...@adobe.com.invalid>:

> Hi Whiskers –
> We periodically have an unfortunate problem where a docker container (or
> worse, many of them) dies off unexpectedly, outside of HTTP usage from
> invoker. In these cases, prewarm or warm containers may still have
> references at the Invoker, and eventually if an activation arrives that
> matches those container references, the HTTP workflow starts and fails
> immediately since the node is not listening anymore, resulting in failed
> activations. Or, any even worse situation, can be when a container failed
> earlier, and a new container, initialized with a different action is
> initialized on the same host and port (more likely a problem for k8s/mesos
> cluster usage).
>
> To mitigate these issues, I put together a health check process [1] from
> invoker to action containers, where we can test
>
>   *   prewarm containers periodically to verify they are still
> operational, and
>   *   warm containers immediately after resuming them (before HTTP
> requests are sent)
> In case of prewarm failure, we should backfill the prewarms to the
> specified config count.
> In case of warm failure, the activation is rescheduled to ContainerPool,
> which typically would either route to a different prewarm, or start a new
> cold container.
>
> The test ping is in the form of tcp connection only, since we otherwise
> need to update the HTTP protocol implemented by all runtimes. This test is
> good enough for the worst case of “container has gone missing”, but cannot
> test for more subtle problems like “/run endpoint is broken”. There could
> be other checks to increase the quality of test we add in the future, but
> most of this I think requires expanding the HTTP protocol and state managed
> at the container, and I wanted to get something working for basic
> functionality to start with.
>
> Let me know if you have opinions about this, and we can discuss  here or
> in the PR.
> Thanks
> Tyson
>
> [1] https://github.com/apache/openwhisk/pull/4698
>