You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Valentin Olteanu (JIRA)" <ji...@apache.org> on 2016/11/23 14:54:58 UTC

[jira] [Commented] (OAK-5058) Improve GC estimation strategy based on both absolute size and relative percentage

    [ https://issues.apache.org/jira/browse/OAK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690305#comment-15690305 ] 

Valentin Olteanu commented on OAK-5058:
---------------------------------------

I think having two settings for the same thing is an overkill and introduces unnecessary confusion and complication in determining when compaction will run.

The reason to have an estimation is to save cpu cycles by determining with a *simple heuristic* if it's worth running compaction. This check should be easy to understand and the default setting should work in most of the cases, but for sure cannot meet all the needs, thus leaving the config option to the user. 

To summarize the three options, here's a list of advantages and disadvantages that I see:

1. Only absolute delta threshold
* PRO: ensures compaction will not run when the growth is very small, independently of the repo size: e.g. an increase of 1GB shows that you can potentially save max 1GB, which makes it easy to relate to exact disk size
* PRO: easy to understand and foresee 
* CON: when the delta is set too high, the repo might grow too much, thus taking longer to compact

2. Only relative delta threshold
* PRO: ensures compaction will also run on smaller repos which don't grow too much
* PRO: easy to understand and foresee 
* CON: might trigger a compaction that saves only a few MB in case of small repos - in this case compaction should anyway be fast, so it's not that bad

3. Both relative and absolute delta threshold
* PRO: (potentially) finer control of when to trigger compaction
* CON: confusing and hard to foresee
* CON: hard to get the two values correlated + needs exhaustive documentation
* CON: can lead to weird cases, e.g. if one wants to rely only on one of the thresholds. 

Personally I would go for the relative check, but maybe more input on PROs / CONs should be considered.

As a side note, I think we should remove {{compaction.disableEstimation}} and keep it simple, as it's the case for {{compaction.memoryThreshold}}: 
{code}
Value represents a percentage so an input between 0 and 100 is expected. Setting this to 0 will disable the check. 
{code}

> Improve GC estimation strategy based on both absolute size and relative percentage
> ----------------------------------------------------------------------------------
>
>                 Key: OAK-5058
>                 URL: https://issues.apache.org/jira/browse/OAK-5058
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: segment-tar
>    Affects Versions: 1.5.12
>            Reporter: Andrei Dulceanu
>            Assignee: Andrei Dulceanu
>            Priority: Minor
>             Fix For: 1.6, 1.5.15
>
>         Attachments: OAK-5058-01.patch
>
>
> A better way of deciding whether GC should run or not might be by looking at the numbers computed in {{SizeDeltaGcEstimation}} from both an absolute size and relative percentage point of view. For example it would make sense to  run compaction only if at least one criterion is met: "run if there is > 50% increase or more than 10GB".
> Since the absolute threshold is already implemented (see {{SegmentGCOptions.SIZE_DELTA_ESTIMATION_DEFAULT}}), it would be nice to add also something like {{SegmentGCOptions.SIZE_PERCENTAGE_ESTIMATION_DEFAULT}} and use it in making the decision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)