You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Alex Parvulescu <al...@gmail.com> on 2014/08/07 11:13:48 UTC

TarMK compaction, proposed update

Hi,

Playing with the TarMK compaction lately, I realized that the process may
create additional files even if globally there is no real need to do so
(not enough garbage to justify running compaction).

The way it works now is: you manually trigger the compaction process, this
will start copying content (via a diff) to new files to allow the old tar
files to be GC'ed. Once done, the cleanup process starts. The cleanup
process will look at each tar file and if it has > 25% garbage it will be
cleaned up (a new generation is created containing only the relevant
content, no garbage).

The disconnect between the compaction and the cleanup can cause even a
clean repo to grow (each new file has a fixed size of 256mb), so if
compaction adds 256mb but the cleanup doesn't find anything useful, your
repo will go up 256mb for no real reason. Over time this will stabilize,
but the first time increase can be a bit unexpected. And the bigger the
repository the bigger the increase.

I'm proposing a solution to alleviate this problem. I'd like to first check
if there is enough garbage in the repo to justify running compaction: check
each tar file and if there's at least one that needs cleanup (>25% garbage)
only then allow the compaction & cleanup to go through. This should
stabilize the size of a repo that didn't change much since the last
compaction run.

I've created OAK-2019 to track this.

Opinions are highly welcome!

alex

Re: TarMK compaction, proposed update

Posted by Jukka Zitting <ju...@zitting.name>.
Hi,

Note that in many cases it's only the compaction operation that will make
no longer used space available for collection by the cleanup operation.
Thus such a pre-check will likely require a full repository traversal to
find all bulk segments that are still being referenced. That should be
doable, though not as simple as just checking each tar file for already
existing garbage. (If there already exists "easily collectable" garbage in
the tar files, you can just run cleanup directly without compaction to
release the space.)

Some related ideas:

- The inlined small blobs currently used by the Lucene index take up quite
a lot of space that needs to be copied around on each compaction. It might
be worth exploring whether increasing the Lucene blob size to prevent
inlining would be worth the extra access overhead. Alternatively the Lucene
index, when running on TarMK, could be made to leverage the TarMK support
for in-place updates to binaries.

- We could keep track of how much space was needed for the last compaction
and only trigger the next one after at least say 25% more space has been
used.

BR,

Jukka Zitting

torstai 7. elokuuta 2014 Alex Parvulescu <al...@gmail.com>
kirjoitti:

> Hi,
>
> Playing with the TarMK compaction lately, I realized that the process may
> create additional files even if globally there is no real need to do so
> (not enough garbage to justify running compaction).
>
> The way it works now is: you manually trigger the compaction process, this
> will start copying content (via a diff) to new files to allow the old tar
> files to be GC'ed. Once done, the cleanup process starts. The cleanup
> process will look at each tar file and if it has > 25% garbage it will be
> cleaned up (a new generation is created containing only the relevant
> content, no garbage).
>
> The disconnect between the compaction and the cleanup can cause even a
> clean repo to grow (each new file has a fixed size of 256mb), so if
> compaction adds 256mb but the cleanup doesn't find anything useful, your
> repo will go up 256mb for no real reason. Over time this will stabilize,
> but the first time increase can be a bit unexpected. And the bigger the
> repository the bigger the increase.
>
> I'm proposing a solution to alleviate this problem. I'd like to first check
> if there is enough garbage in the repo to justify running compaction: check
> each tar file and if there's at least one that needs cleanup (>25% garbage)
> only then allow the compaction & cleanup to go through. This should
> stabilize the size of a repo that didn't change much since the last
> compaction run.
>
> I've created OAK-2019 to track this.
>
> Opinions are highly welcome!
>
> alex
>


-- 
Jukka Zitting