You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2014/06/30 12:23:48 UTC

TarMK compaction status update

Hi,

We recently added a new "compaction" feature to the TarMK (see
OAK-1804). This feature traverses the content tree and copies all
non-bulk* content to new data segments. We do this for two main
reasons:

a) The cleanup operation is unable to collect data segments with a mix
of both reachable and unreachable content or indeed any segments
referenced by such mixed segments, regardless of whether those
segments have any reachable content. By copying all reachable content
to new data segments, the compaction makes the previously mixed data
segments and their references collectable by the cleanup operation.

b) Commits over time can end up splintering related parts of the
content tree over many small segments, which reduces locality of
reference and makes caching less efficient. Compaction reverses this
process and ensures that related content (same subtree) gets packed
together in the compacted segments.

The compaction feature works both offline with the oak-run tool and
online with the FileStore.gc() method, and apart from a few smaller
issues like OAK-1917 and OAK-1927 it generally works fine. However,
based on some real-world usage I've identified one bigger issue that
still needs solving:

The compaction code already takes into account the chance of
concurrent commits during compaction. Such commits already get
automatically rebased to the compacted state to prevent them from
referencing data segments from before the compaction. Unfortunately,
since the compactor shares the SegmentWriter with normal repository
updates, also the pre-rebased commits typically end up sharing the
same segments with compacted content. This causes such segments to
become troublesome mixed segments that still contain references to
data segments from before compaction, and that thus prevent those
older segments from being cleaned up.

That problem should be solvable by using a separate SegmentWriter
instance for the compaction. I'm looking at this now.

*) Bulk content consists of binaries larger than 16kB. They get stored
in bulk segments (or in a data store, if so configured), and just
referenced from the tree structured stored in data segments.

BR,

Jukka Zitting