You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by Georg Henzler <sl...@cq-eclipse-plugin.net> on 2013/12/05 07:51:23 UTC
Health Check Improvements
Hi all,
I implemented a health check infrastructure + a whole set of checks a
year ago (in production now). Now this year I found out that there is
effort being made on the Sling side [1]. I just had a closer look at the
Sling code and I like some of the concepts but believe some other things
could maybe be improved. I'd be happy to contribute some parts from my
side if the Sling Community is interested. I roughly give an overview
over the two approaches:
****** "My HC Infrastructure"
* Health Checks are OSGi Components that implement an interface, almost
exactly in line with org.apache.sling.hc.api.HealthCheck
* There is an emphasis on getting the overall status of the system:
There is a Web Console Plugin and whiteboard servlet (not being
dependent on sling) to retrieve an aggregated result of all health
checks registered as services
* The result of an individual health check can be RED, AMBER or GREEN -
the overall result is the worst result found
* The servlet allows to retrieve result of all checks in html, json and
jsonp (contains overall result + result of each check in a structured,
machine-readable format)
* There are custom checks for the project to make sure a few SOAP and
REST are available - if they fail they return AMBER. AMBER means someone
should pay attention, but the system itself is still stable.
* All individual checks are executed in parallel by a class
HealthCheckRunner (using Futures/ExecutorService under the hood). The
advantage is that the overall result can alwasy quickly be calculated
(especially the latency in the SAOP/REST checks required this!). The
HealthCheckRunner makes sure that the threads being used for this are
limited to the no of registered health checks and if one check hangs it
handles it correctly with timeout settings in the OSGi console (it took
a while to get rid of all problems of parallel execution, but now it's
rock-solid, the only downside being some extra threads/memory required).
* The health check is used by a monitoring system on customer side
(similar to Nagios)
* The health check servlet has a parameter to return HTTP 500 if the
overall result is RED, this is used by the load balancer of the
publisher servers to automatically take out failing instances (this is
only possible because of the parallel execution)
* I had a jenkins plugin in place to show an overview page of 10+ CQ
instances using the JSON results (DEV/TEST/INT AUTHOR/PUBLISH etc.)
* No JMX integration
****** SLING Health Check (as of today)
* Core defines API and some utility classes. The result contains log
entries for each check.
* The core itself is not able to run checks (if I got that right)
* Tags are used to be able to run a group of checks (I quite like
this!)
* The Web Console Plugin gives a nice interface to humans to run all
checks for a given tag (or all checks if the tags are omitted).
Execution is sequentially and can potentially take a long time (depends
really on checks)
* JMX allows to get the status of a certain health check, but it is not
possible to retrieve an overall status via JMX (if I got that right)
* There is no way to retrieve an overall result in JSON (if I got that
right)
* There is an example for async execution (AsyncHealthCheckSample) -
however this aspect needs to be implemented for every check in need for
asnync execution again
As a first step, I would like to propose the following:
* Introduce HealthCheckRunner to hc-core with the following signature:
List<Result> HealthCheckRunner.runAllForTags(String... tags) //
the list is sorted to put failed ones always on top
* The HealthCheckRunner would use the existing class HealthCheckFilter
to retrieve the service references
* The Web Console would be adjusted to use HealthCheckRunner
* I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result
* Add parameter format=json to /system/console/healthcheck to provide
the result in JSON format (to avoid an extra servlet, I think it is
possible for console urls to return JSON but I would have to check)
Let me know what you think - as everything is there already I could
fairly quickly provide a patch for this (but I only make the effort to
create one if you think it's valuable).
Regards
Georg
[1]
http://www.slideshare.net/bdelacretaz/slinghc-bdelacretazadaptto2013
http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
https://issues.apache.org/jira/browse/SLING/component/12320832
Re: Health Check Improvements
Posted by Georg Henzler <sl...@cq-eclipse-plugin.net>.
Hi Bertrand,
I created https://issues.apache.org/jira/browse/SLING-3278 for the
prototype health check executor service. You can assign the task to me
if you like (I couldn't do that myself as I don't seem to have the
permissions for assigning...)
Georg
Am 09.12.2013 16:39, schrieb Bertrand Delacretaz:
> Hi Georg,
>
> On Thu, Dec 5, 2013 at 7:51 AM, Georg Henzler
> <sl...@cq-eclipse-plugin.net> wrote:
>> ...I just had a closer look at the Sling code
>> and I like some of the concepts but believe some other things could
>> maybe be
>> improved...
>
> Thanks for your review - I agree that we need better control on the
> execution time and asynchronous execution of our health checks.
>
> We discussed this recently [1] and what's suggested there is fairly
> similar to what you suggest in terms of health checks execution, with
> timeouts and caching of previously computed values.
>
>> ...There is an emphasis on getting the overall status of the system:
>> There is a Web Console Plugin
>> and whiteboard servlet (not being dependent on sling) to retrieve an
>> aggregated result of all
>> health checks registered as services...
>
> You can aggregate Sling health checks with the CompositeHealthCheck
> that's briefly described at [3] and used in the health check samples,
> would that cover your use cases?
>
>> As a first step, I would like to propose the following:
>> * Introduce HealthCheckRunner to hc-core with the following
>> signature:
>> List<Result> HealthCheckRunner.runAllForTags(String... tags)
>> // the
>> list is sorted to put failed ones always on top...
>
> I don't think I would sort here, that's a presentation concern - I
> prefer having a stable order in the output of the execution service
> itself.
>
>> * The HealthCheckRunner would use the existing class
>> HealthCheckFilter to
>> retrieve the service references
>
> Sounds good
>
>> * The Web Console would be adjusted to use HealthCheckRunner
>
> Ok
>
>> * I would add getExecutionTimeInMs() to
>> org.apache.sling.hc.api.Result
>
> If we're caching the Results I'd add creation timestamp, an
> expiration
> time that can be set when creating the Result and the execution
> duration as you suggest.
>
>> ...* Add parameter format=json to /system/console/healthcheck to
>> provide the
>> result in JSON format (to avoid an extra servlet, I think it is
>> possible for
>> console urls to return JSON but I would have to check)...
>
> Maybe we don't need that as we have the SLING-2999 JMX resource
> provider, but in general this makes sense.
>
> If you want to provide a prototype health check executor service that
> would be cool. Note that we have a Sling thread pools service [2]
> that's probably useful for that.
>
> -Bertrand
>
> [1] http://markmail.org/message/ioatdxdogexacu2b
>
> [2]
>
> http://sling.apache.org/documentation/bundles/apache-sling-commons-thread-pool.html
>
> [3]
>
> http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
Re: Health Check Improvements
Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi Georg,
On Thu, Dec 5, 2013 at 7:51 AM, Georg Henzler
<sl...@cq-eclipse-plugin.net> wrote:
> ...I just had a closer look at the Sling code
> and I like some of the concepts but believe some other things could maybe be
> improved...
Thanks for your review - I agree that we need better control on the
execution time and asynchronous execution of our health checks.
We discussed this recently [1] and what's suggested there is fairly
similar to what you suggest in terms of health checks execution, with
timeouts and caching of previously computed values.
> ...There is an emphasis on getting the overall status of the system: There is a Web Console Plugin
> and whiteboard servlet (not being dependent on sling) to retrieve an aggregated result of all
> health checks registered as services...
You can aggregate Sling health checks with the CompositeHealthCheck
that's briefly described at [3] and used in the health check samples,
would that cover your use cases?
> As a first step, I would like to propose the following:
> * Introduce HealthCheckRunner to hc-core with the following signature:
> List<Result> HealthCheckRunner.runAllForTags(String... tags) // the
> list is sorted to put failed ones always on top...
I don't think I would sort here, that's a presentation concern - I
prefer having a stable order in the output of the execution service
itself.
> * The HealthCheckRunner would use the existing class HealthCheckFilter to
> retrieve the service references
Sounds good
> * The Web Console would be adjusted to use HealthCheckRunner
Ok
> * I would add getExecutionTimeInMs() to org.apache.sling.hc.api.Result
If we're caching the Results I'd add creation timestamp, an expiration
time that can be set when creating the Result and the execution
duration as you suggest.
> ...* Add parameter format=json to /system/console/healthcheck to provide the
> result in JSON format (to avoid an extra servlet, I think it is possible for
> console urls to return JSON but I would have to check)...
Maybe we don't need that as we have the SLING-2999 JMX resource
provider, but in general this makes sense.
If you want to provide a prototype health check executor service that
would be cool. Note that we have a Sling thread pools service [2]
that's probably useful for that.
-Bertrand
[1] http://markmail.org/message/ioatdxdogexacu2b
[2] http://sling.apache.org/documentation/bundles/apache-sling-commons-thread-pool.html
[3] http://sling.apache.org/documentation/bundles/sling-health-check-tool.html
Re: Health Check Improvements
Posted by Felix Meschberger <fm...@adobe.com>.
Hi Georg
This sounds great. I have just a single comment in addition to Bertrand's:
Am 05.12.2013 um 07:51 schrieb Georg Henzler <sl...@cq-eclipse-plugin.net>:
> * Add parameter format=json to /system/console/healthcheck to provide
> the result in JSON format (to avoid an extra servlet, I think it is
> possible for console urls to return JSON but I would have to check)
I suggest to use just a request extension to ask for JSON format as we do in other plugins such that we would get /system/console/healthcheck.json
Regards
Felix