You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Stefan Eissing (JIRA)" <ji...@apache.org> on 2017/01/10 13:23:58 UTC
[jira] [Commented] (OAK-5433) System Pacing Service

    [ https://issues.apache.org/jira/browse/OAK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15814957#comment-15814957 ] 

Stefan Eissing commented on OAK-5433:
-------------------------------------

Some more explanation to the patch:
# it also experimented around with different queue compaction algorithms. There are some better than others, but avoiding compaction is best. These parts can be ignored
# there are levels assigned to threads that can be paced. These were experiments to pace servlet threads first, workflow threads later. ultimately, treating both the same worked best.
# the code assumes that servlet, workflow and other threads come form separate pools. Which may or may not be true in an application. For the tests, it held true. The decision could also be done based on other facts. 

> System Pacing Service
> ---------------------
>
>                 Key: OAK-5433
>                 URL: https://issues.apache.org/jira/browse/OAK-5433
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Stefan Eissing
>         Attachments: obs-pacing.diff
>
>
> h3. tl;dr
> By adding Pacing, suitable to the application {{oak}} is running in, a system will dynamically adapt the load to its own capabilities. This effectively, in tests, keeps the system stable and responsive under stress.
> h3. The Situation
> During experimental Lab tests on large clusters, it became clear that the a web system based using oak is challenged by fluctuating load in relation to its own capabilities. 
> When the load increases "too much" it shows the following symptoms:
> * event observation queues grow
> * maintenance tasks (on master) take too long
> * async tasks, triggered by requests, (e.g. workflows) accumulate
> and eventually
> * login sessions complain about freshness
> * revisions diffs are old and no longer in caches
> and sometimes
> * database lease times out and oak-core shuts down
> This problem can arise when outside requests increased, or when local maintenance tasks occupy resources, or when available CPU diminishes due to other processes or page faults or, or, or.
> Unfortunately, whenever the system becomes overburdened, the secondary effects make the system even slower and, thus, more overburdened. This can end in a vicious circle, making the system total unresponsive. Eventual recovery is an option, not a guarantee.
> h3. Pacing
> By _Pacing_ I mean a system behaviour that tries to balance load in relation to capabilities. If the latter one drops, the load must be reduced until the system recovers. This is related to what the {{CommitRateLimiter}} wanted to achieve
> by monitoring observation queues.
> The design of the {{CommitRateLimiter}}  could be very efficient, if it only know _which_ commits to delay. But it does not know the application that oak runs in. I propose replacing the Limiter by a {{PacingService}} that can be provided by the application using oak. The service will get the data about the current commit, queue length and limits. Whatever else it does remains opaque. It may raise a proper exception to indicate that the commit shall fail. But mostly, it is expected to delay those commits that would negatively affect system stability.
> h3. An Example
> In a proof of concept, an AEM system was blasted with endless uploads on multiple connections in order to eventually overwhelm queues. The a pacing was patched into oak-core that delayed commits from servlet requests and from certain workflows for some milliseconds until the queue length shrank again. The pacing had a maximum wait time that would make the commit fail.
> The pacer was configured to trigger at 75% of maximum queue length and the system was blasted with uploads again. In the tests:
> # the max queue length stayed under 80%
> # no upload did reach the maximum time, all succeeded
> The system adapted the external load to its capabilities successfully. 
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)