You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by Georg Henzler <sl...@cq-eclipse-plugin.net> on 2013/12/05 07:51:23 UTC

Health Check Improvements

Hi all,

I implemented a health check infrastructure + a whole set of checks a 
year ago (in production now). Now this year I found out that there is 
effort being made on the Sling side [1]. I just had a closer look at the 
Sling code and I like some of the concepts but believe some other things 
could maybe be improved. I'd be happy to contribute some parts from my 
side if the Sling Community is interested. I roughly give an overview 
over the two approaches:

****** "My HC Infrastructure"
* Health Checks are OSGi Components that implement an interface, almost 
exactly in line with org.apache.sling.hc.api.HealthCheck
* There is an emphasis on getting the overall status of the system: 
There is a Web Console Plugin and whiteboard servlet (not being 
dependent on sling) to retrieve an aggregated result of all health 
checks registered as services
* The result of an individual health check can be RED, AMBER or GREEN - 
the overall result is the worst result found
* The servlet allows to retrieve result of all checks in html, json and 
jsonp (contains overall result + result of each check in a structured, 
machine-readable format)
* There are custom checks for the project to make sure a few SOAP and 
REST are available - if they fail they return AMBER. AMBER means someone 
should pay attention, but the system itself is still stable.
* All individual checks are executed in parallel by a class 
HealthCheckRunner (using Futures/ExecutorService under the hood). The 
advantage is that the overall result can alwasy quickly be calculated 
(especially the latency in the SAOP/REST checks required this!). The 
HealthCheckRunner makes sure that the threads being used for this are 
limited to the no of registered health checks and if one check hangs it 
handles it correctly with timeout settings in the OSGi console (it took 
a while to get rid of all problems of parallel execution, but now it's 
rock-solid, the only downside being some extra threads/memory required).
* The health check is used by a monitoring system on customer side 
(similar to Nagios)
* The health check servlet has a parameter to return HTTP 500 if the 
overall result is RED, this is used by the load balancer of the 
publisher servers to automatically take out failing instances (this is 
only possible because of the parallel execution)
* I had a jenkins plugin in place to show an overview page of 10+ CQ 
instances using the JSON results (DEV/TEST/INT AUTHOR/PUBLISH etc.)
* No JMX integration

****** SLING Health Check (as of today)
* Core defines API and some utility classes. The result contains log 
entries for each check.
* The core itself is not able to run checks (if I got that right)
* Tags are used to be able to run a group of checks (I quite like 
this!)
* The Web Console Plugin gives a nice interface to humans to run all 
checks for a given tag (or all checks if the tags are omitted). 
Execution is sequentially and can potentially take a long time (depends 
really on checks)
* JMX allows to get the status of a certain health check, but it is not 
possible to retrieve an overall status via JMX (if I got that right)
* There is no way to retrieve an overall result in JSON (if I got that 
right)
* There is an example for async execution (AsyncHealthCheckSample) - 
however this aspect needs to be implemented for every check in need for 
asnync execution again

As a first step, I would like to propose the following:
* Introduce HealthCheckRunner to hc-core  with the following signature:
        List<Result> HealthCheckRunner.runAllForTags(String... tags) // 
the list is sorted to put failed ones always on top
* The HealthCheckRunner would use the existing class HealthCheckFilter 
to retrieve the service references
* The Web Console would be adjusted to use HealthCheckRunner
* I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result
* Add parameter format=json to /system/console/healthcheck to provide 
the result in JSON format (to avoid an extra servlet, I think it is 
possible for console urls to return JSON but I would have to check)

Let me know what you think - as everything is there already I could 
fairly quickly provide a patch for this (but I only make the effort to 
create one if you think it's valuable).

Regards
Georg

[1]
http://www.slideshare.net/bdelacretaz/slinghc-bdelacretazadaptto2013
http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
https://issues.apache.org/jira/browse/SLING/component/12320832

Re: Health Check Improvements

Posted by Georg Henzler <sl...@cq-eclipse-plugin.net>.
Hi Bertrand,

I created https://issues.apache.org/jira/browse/SLING-3278 for the 
prototype health check executor service. You can assign the task to me 
if you like (I couldn't do that myself as I don't seem to have the 
permissions for assigning...)

Georg

Am 09.12.2013 16:39, schrieb Bertrand Delacretaz:
> Hi Georg,
>
> On Thu, Dec 5, 2013 at 7:51 AM, Georg Henzler
> <sl...@cq-eclipse-plugin.net> wrote:
>> ...I just had a closer look at the Sling code
>> and I like some of the concepts but believe some other things could 
>> maybe be
>> improved...
>
> Thanks for your review - I agree that we need better control on the
> execution time and asynchronous execution of our health checks.
>
> We discussed this recently [1] and what's suggested there is fairly
> similar to what you suggest in terms of health checks execution, with
> timeouts and caching of previously computed values.
>
>> ...There is an emphasis on getting the overall status of the system: 
>> There is a Web Console Plugin
>> and whiteboard servlet (not being dependent on sling) to retrieve an 
>> aggregated result of all
>> health checks registered as services...
>
> You can aggregate Sling health checks with the CompositeHealthCheck
> that's briefly described at [3] and used in the health check samples,
> would that cover your use cases?
>
>> As a first step, I would like to propose the following:
>> * Introduce HealthCheckRunner to hc-core  with the following 
>> signature:
>>        List<Result> HealthCheckRunner.runAllForTags(String... tags) 
>> // the
>> list is sorted to put failed ones always on top...
>
> I don't think I would sort here, that's a presentation concern - I
> prefer having a stable order in the output of the execution service
> itself.
>
>> * The HealthCheckRunner would use the existing class 
>> HealthCheckFilter to
>> retrieve the service references
>
> Sounds good
>
>> * The Web Console would be adjusted to use HealthCheckRunner
>
> Ok
>
>> * I would add getExecutionTimeInMs() to 
>> org.apache.sling.hc.api.Result
>
> If we're caching the Results I'd add creation timestamp, an 
> expiration
> time that can be set when creating the Result and the execution
> duration as you suggest.
>
>> ...* Add parameter format=json to /system/console/healthcheck to 
>> provide the
>> result in JSON format (to avoid an extra servlet, I think it is 
>> possible for
>> console urls to return JSON but I would have to check)...
>
> Maybe we don't need that as we have the SLING-2999 JMX resource
> provider, but in general this makes sense.
>
> If you want to provide a prototype health check executor service that
> would be cool. Note that we have a Sling thread pools service [2]
> that's probably useful for that.
>
> -Bertrand
>
> [1] http://markmail.org/message/ioatdxdogexacu2b
>
> [2]
> 
> http://sling.apache.org/documentation/bundles/apache-sling-commons-thread-pool.html
>
> [3]
> 
> http://sling.apache.org/documentation/bundles/sling-health-check-tool.html


Re: Health Check Improvements

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi Georg,

On Thu, Dec 5, 2013 at 7:51 AM, Georg Henzler
<sl...@cq-eclipse-plugin.net> wrote:
> ...I just had a closer look at the Sling code
> and I like some of the concepts but believe some other things could maybe be
> improved...

Thanks for your review - I agree that we need better control on the
execution time and asynchronous execution of our health checks.

We discussed this recently [1] and what's suggested there is fairly
similar to what you suggest in terms of health checks execution, with
timeouts and caching of previously computed values.

> ...There is an emphasis on getting the overall status of the system: There is a Web Console Plugin
> and whiteboard servlet (not being dependent on sling) to retrieve an aggregated result of all
> health checks registered as services...

You can aggregate Sling health checks with the CompositeHealthCheck
that's briefly described at [3] and used in the health check samples,
would that cover your use cases?

> As a first step, I would like to propose the following:
> * Introduce HealthCheckRunner to hc-core  with the following signature:
>        List<Result> HealthCheckRunner.runAllForTags(String... tags) // the
> list is sorted to put failed ones always on top...

I don't think I would sort here, that's a presentation concern - I
prefer having a stable order in the output of the execution service
itself.

> * The HealthCheckRunner would use the existing class HealthCheckFilter to
> retrieve the service references

Sounds good

> * The Web Console would be adjusted to use HealthCheckRunner

Ok

> * I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result

If we're caching the Results I'd add creation timestamp, an expiration
time that can be set when creating the Result and the execution
duration as you suggest.

> ...* Add parameter format=json to /system/console/healthcheck to provide the
> result in JSON format (to avoid an extra servlet, I think it is possible for
> console urls to return JSON but I would have to check)...

Maybe we don't need that as we have the SLING-2999 JMX resource
provider, but in general this makes sense.

If you want to provide a prototype health check executor service that
would be cool. Note that we have a Sling thread pools service [2]
that's probably useful for that.

-Bertrand

[1] http://markmail.org/message/ioatdxdogexacu2b

[2] http://sling.apache.org/documentation/bundles/apache-sling-commons-thread-pool.html

[3] http://sling.apache.org/documentation/bundles/sling-health-check-tool.html

Re: Health Check Improvements

Posted by Felix Meschberger <fm...@adobe.com>.
Hi Georg

This sounds great. I have just a single comment in addition to Bertrand's:


Am 05.12.2013 um 07:51 schrieb Georg Henzler <sl...@cq-eclipse-plugin.net>:

> * Add parameter format=json to /system/console/healthcheck to provide 
> the result in JSON format (to avoid an extra servlet, I think it is 
> possible for console urls to return JSON but I would have to check)

I suggest to use just a request extension to ask for JSON format as we do in other plugins such that we would get /system/console/healthcheck.json

Regards
Felix