You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Michael Dürig (JIRA)" <ji...@apache.org> on 2017/06/29 12:51:00 UTC

[jira] [Comment Edited] (OAK-3349) Partial compaction

    [ https://issues.apache.org/jira/browse/OAK-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068288#comment-16068288 ] 

Michael Dürig edited comment on OAK-3349 at 6/29/17 12:50 PM:
--------------------------------------------------------------

h6. Implementation note on tail compaction 

In contrast to the existing compaction approach (full compaction) tail compaction rebases all changes since the last compaction on top of the result of that last compaction. Cleanup subsequently cleans up the uncompacted changes. Each tail compaction cycle creates a new generation incrementing the generation number. Cleanup remove all non compacted segments whose generation is no bigger than the current generation minus a certain amount of retained generations (2 by default). 

To make this work we need to be able to determine the age of a segment (in number of generations) and whether a segment has been written by the compactor or by a regular writer (and is thus uncompacted). The [POC|https://github.com/mduerig/jackrabbit-oak/commits/OAK-3349-POC] implemented this by assigning even generation numbers to regular segments and odd ones to segment written by tail compaction while at the same time completely removing support for full compaction.

To combine tail compaction with full compaction I suggest to introduce a young generation field in the segment header, which is used by tail compaction as described. The existing generation field will thus keep being used for full compaction without changing its semantics. 

The proposed approach has the advantage of tail and full compaction being completely orthogonal. You can run either of which or both without one affecting or influencing the other. 
Both compaction and cleanup methods solely rely on the information in the segment headers. A predicate for determining which segments to retain can be inferred from the segment containing the head revision. There is no need to rely on auxiliary information with the small exception of tail compaction using the {{gc.log}} file to determine the base revision to compact onto. This is not problematic though wrt. to resilience as we can always fall back to full compaction should the base revision be invalid. (A base revision can be invalid in two ways: either is is not found or it is one not written by the compactor. Both cases can only occur after manual tampering with the {{journal.log}}.)
Finally the approach plays well with upgrading: while the additional young generation field requires us to bump the segment version we can easily maintain backwards compatibility and do a rolling upgrade segment by segment. Segments of the prevision version will just not be eligible for cleanup under tail compaction. 



was (Author: mduerig):
h6. Implementing note on tail compaction 

In contrast to the existing compaction approach (full compaction) tail compaction rebases all changes since the last compaction on top of the result of that last compaction. Cleanup subsequently cleans up the uncompacted changes. Each tail compaction cycle creates a new generation incrementing the generation number. Cleanup remove all non compacted segments whose generation is no bigger than the current generation minus a certain amount of retained generations (2 by default). 

To make this work we need to be able to determine the age of a segment (in number of generations) and whether a segment has been written by the compactor or by a regular writer (and is thus uncompacted). The [POC|https://github.com/mduerig/jackrabbit-oak/commits/OAK-3349-POC] implemented this by assigning even generation numbers to regular segments and odd ones to segment written by tail compaction while at the same time completely removing support for full compaction.

To combine tail compaction with full compaction I suggest to introduce a young generation field in the segment header, which is used by tail compaction as described. The existing generation field will thus keep being used for full compaction without changing its semantics. 

The proposed approach has the advantage of tail and full compaction being completely orthogonal. You can run either of which or both without one affecting or influencing the other. 
Both compaction and cleanup methods solely rely on the information in the segment headers. A predicate for determining which segments to retain can be inferred from the segment containing the head revision. There is no need to rely on auxiliary information with the small exception of tail compaction using the {{gc.log}} file to determine the base revision to compact onto. This is not problematic though wrt. to resilience as we can always fall back to full compaction should the base revision be invalid. (A base revision can be invalid in two ways: either is is not found or it is one not written by the compactor. Both cases can only occur after manual tampering with the {{journal.log}}.)
Finally the approach plays well with upgrading: while the additional young generation field requires us to bump the segment version we can easily maintain backwards compatibility and do a rolling upgrade segment by segment. Segments of the prevision version will just not be eligible for cleanup under tail compaction. 


> Partial compaction
> ------------------
>
>                 Key: OAK-3349
>                 URL: https://issues.apache.org/jira/browse/OAK-3349
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: compaction, gc, scalability
>             Fix For: 1.8, 1.7.4
>
>         Attachments: compaction-time.png, cycle-count.png, post-gc-size.png
>
>
> On big repositories compaction can take quite a while to run as it needs to create a full deep copy of the current root node state. For such cases it could be beneficial if we could partially compact the repository thus splitting full compaction over multiple cycles. 
> Partial compaction would run compaction on a sub-tree just like we now run it on the full tree. Afterwards it would create a new root node state by referencing the previous root node state replacing said sub-tree with the compacted one. 
> Todo: Asses feasibility and impact, implement prototype.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)