You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sling.apache.org by Carsten Ziegeler <cz...@apache.org> on 2013/10/22 13:06:25 UTC

Long running health checks, jmx registration and concurrent invocation

While starting to use the new health check stuff we came across different
things which I would like to discuss.

According to the API health checks are considered to execute quickly -
which is fine. However there is no prevention against it. I'm not sure if
we should do this, but e.g. the EventAdmin blacklists long running health
checks after their first invocation.
This gets even more tricky as health checks are registered as mbeans with
only attributes and no methods. The assumption here is, that whenever the
mbean is triggered (an attribute value is fetched), the health check is
executed. This is fine as long as the health check execution is fast and
the client acknowledges this. If the client fetches all available
attributes in one call, the hc is executed only once. If the client fetches
the attributes one after the other, the hc is executed on each attribute
fetch. Now combine this with a long running health check.

This brings me to the topic of concurrent invocations. Assuming a health
check execution is fast, this shouldn't be a problem - if it's not,
concurrent invocation might lead to problems. Imagine N users checking the
health of the system at the same time - or monitoring agents fetching
regularly the status. Maybe the execution should rather be synchronized?

And finally for long running health checks whether they are done sync or
async users would like to see a progress bar once the hc runs.

All of this can be solved easily, if we stick to "health check execution
should be fast and not expensive". In that case we might add black listing.
Things like a progress bar etc. have to be done through whatever mechanism
is used to execute the hc asynchronously.

WDYT?

Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org

Re: Long running health checks, jmx registration and concurrent invocation

Posted by Carsten Ziegeler <cz...@apache.org>.

Yes, I guess this is a good thing to add - it would require such an
executor service (or whatever name we pick), with the slight downside, that
a health check should never be invoked directly but only through the
executor. But if we clearly state this, I don't see an issue with that.

Carsten


2013/10/22 Bertrand Delacretaz <bd...@apache.org>

> Hi,
>
> On Tue, Oct 22, 2013 at 1:06 PM, Carsten Ziegeler <cz...@apache.org>
> wrote:
> > ...According to the API health checks are considered to execute quickly -
> > which is fine. However there is no prevention against it. I'm not sure if
> > we should do this, but e.g. the EventAdmin blacklists long running health
> > checks after their first invocation...
>
> I agree that preventing slow HealthCheck.execute() methods is a good idea.
>
> > ...This gets even more tricky as health checks are registered as mbeans
> with
> > only attributes and no methods...
>
> Right, enforcing fast execution looks like the right thing to do.
>
> > ...All of this can be solved easily, if we stick to "health check
> execution
> > should be fast and not expensive". In that case we might add black
> listing.
> > Things like a progress bar etc. have to be done through whatever
> mechanism
> > is used to execute the hc asynchronously....
>
> Instead of permanent blacklisting I'd suggest returning a normal
> Result but with a TIMEOUT status.
>
> For example, a health check that causes lots of initializations (maybe
> because it's called right after Sling startup) might be quite slow on
> the first call, and then fast, so TIMEOUT on first call and actual
> results (maybe computed asynchronously) later makes sense, but
> permanent blacklisting would get in the way.
>
> I suggest implementing a timeout on the HealthCheck.execute() method
> (not sure how - HealthCheckExecutor service maybe) which returns a
> Result with a TIMEOUT state, that indicates how long the timeout was,
> and maybe a short term blacklisting of the HealthCheck, during which
> it returns a BLACKLISTED state result.
>
> WDYT?
>
> -Bertrand
>



-- 
Carsten Ziegeler
cziegeler@apache.org

Re: Long running health checks, jmx registration and concurrent invocation

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi,

On Tue, Oct 22, 2013 at 1:06 PM, Carsten Ziegeler <cz...@apache.org> wrote:
> ...According to the API health checks are considered to execute quickly -
> which is fine. However there is no prevention against it. I'm not sure if
> we should do this, but e.g. the EventAdmin blacklists long running health
> checks after their first invocation...

I agree that preventing slow HealthCheck.execute() methods is a good idea.

> ...This gets even more tricky as health checks are registered as mbeans with
> only attributes and no methods...

Right, enforcing fast execution looks like the right thing to do.

> ...All of this can be solved easily, if we stick to "health check execution
> should be fast and not expensive". In that case we might add black listing.
> Things like a progress bar etc. have to be done through whatever mechanism
> is used to execute the hc asynchronously....

Instead of permanent blacklisting I'd suggest returning a normal
Result but with a TIMEOUT status.

For example, a health check that causes lots of initializations (maybe
because it's called right after Sling startup) might be quite slow on
the first call, and then fast, so TIMEOUT on first call and actual
results (maybe computed asynchronously) later makes sense, but
permanent blacklisting would get in the way.

I suggest implementing a timeout on the HealthCheck.execute() method
(not sure how - HealthCheckExecutor service maybe) which returns a
Result with a TIMEOUT state, that indicates how long the timeout was,
and maybe a short term blacklisting of the HealthCheck, during which
it returns a BLACKLISTED state result.

WDYT?

-Bertrand