You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/08/16 10:22:22 UTC
[jira] [Commented] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs

    [ https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422553#comment-15422553 ] 

Chetan Mehrotra commented on SLING-5965:
----------------------------------------

Looks useful!. Couple of points
{noformat}
+        final Counter runningJobsCounter = metricsService == null ? null : metricsService.counter(QuartzScheduler.METRICS_NAME_RUNNING_QUARTZJOBS);
+        final Timer jobDurationTimer = metricsService == null ? null : metricsService.timer(QuartzScheduler.METRICS_NAME_QUARTZJOBS_DURATION);
{noformat}
Instead of all those null checks you can just fallback to {{MetricsService#NOOP}}. This would make code cleaner

* For collecting job runtime it would be better to make use of [JobListener|http://www.quartz-scheduler.org/documentation/quartz-2.1.x/cookbook/JobListeners.html] where you can get execution of time of any fired job via {{JobExecutionContext#getJobRunTime}}
* We can look into exposing [QuartzSchedulerMBean|http://www.quartz-scheduler.org/api/2.1.7/org/quartz/core/jmx/QuartzSchedulerMBean.html]. Probably some methods would need to be disabled like those around adding job (but might be fine also)
* Direct dependency on MetricRegistry should be avoided. If guage support is required we can add an abstraction for that in Commons Metrics

> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.5.2
>
>         Attachments: SLING-5965.patch
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. They are served from a thread-pool and should occupy that thread only for a short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they start to occupy threads from that thread-pool, thus have an influence on the performance of other scheduled/quartz-jobs.
> We should have metrics (using [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html]) that provide information about internas of Sling Scheduler, such as average, max etc duration of scheduled jobs, as well as how many jobs are currently running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and flag {{critical}} when eg the oldest job is older than {{60'000ms}} (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)