You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Julian Sedding <js...@apache.org> on 2016/10/03 07:59:29 UTC

Datastore GC only possible after Tar Compaction

Hi all

I just became aware that on a system configured with SegmentNodeStore
and FileDatastore a Datastore garbage collection can only free up
space *after* a Tar Compaction was run.

This behaviour is not immediately intuitive to me.

I would like to discuss whether it is desirable to require a Tar
Compaction prior to a DS GC. If someone knows about the rationale
behind this behaviour, I would also appreciate these insights!

The alternative behaviour, which I would have expected, is to collect
only binaries that are referenced from the root NodeState or any of
the checkpoint's root NodeStates (i.e. "live" NodeStates).

From an implementation perspective, I assume that the current
behaviour can be implemented with better performance than a solution
that checks only "live" NodeStates. However, IMHO that should not be
the only relevant factor in the discussion.

I'm looking forward to your feedback!

Regards
Julian

Re: Datastore GC only possible after Tar Compaction

Posted by Michael Dürig <md...@apache.org>.

On 5.10.16 10:14 , Julian Sedding wrote:
> Would it be possible to improve the heuristic without traversing the
> node tree? I.e. do the segment tar files contain sufficient
> information in their indexes to safely determine that some binary
> references are dead? I'm looking for no false positives but possibly
> many false negatives.

Oak Segment Tar now has an index of binaries in the tar files. See 
OAK-4201. This avoids having to traverse for reachability.

It suffers the same issue though: DSGC is only effective after a 
revision gc. Since Oak Segment Tar implements a retention time based 
revision gc model, this will also affect the DSCG. I agree this should 
be better documents and I hope to do so once Oak Segment Tar is 
sufficiently stabilised. See OAK-4292. Patches are always welcome though ;-)

Michael

Re: Datastore GC only possible after Tar Compaction

Posted by Julian Sedding <js...@gmail.com>.
Thanks Amit for your insights.

Is it documented that DS GC is ineffective if no prior tar compaction
is performed? IMHO we should make this as clear as possible, because
the behaviour deviates from JR2 and thus has the potential to throw
lots of users. Possibly even mention it as a possible reason in the
log message if DS GC was ineffective.

Would it be possible to improve the heuristic without traversing the
node tree? I.e. do the segment tar files contain sufficient
information in their indexes to safely determine that some binary
references are dead? I'm looking for no false positives but possibly
many false negatives.

Regards
Julian


On Mon, Oct 3, 2016 at 10:37 AM, Amit Jain <am...@ieee.org> wrote:
> Hi,
>
> On Mon, Oct 3, 2016 at 1:29 PM, Julian Sedding <js...@apache.org> wrote:
>
>> I just became aware that on a system configured with SegmentNodeStore
>> and FileDatastore a Datastore garbage collection can only free up
>> space *after* a Tar Compaction was run.
>>
>>
> Yes that is a pre-requisite.
>
>
>> I would like to discuss whether it is desirable to require a Tar
>> Compaction prior to a DS GC. If someone knows about the rationale
>> behind this behaviour, I would also appreciate these insights!
>>
>> The alternative behaviour, which I would have expected, is to collect
>> only binaries that are referenced from the root NodeState or any of
>> the checkpoint's root NodeStates (i.e. "live" NodeStates).
>>
>> From an implementation perspective, I assume that the current
>> behaviour can be implemented with better performance than a solution
>> that checks only "live" NodeStates. However, IMHO that should not be
>> the only relevant factor in the discussion.
>>
>
> I believe the performance impact of loading all nodes to check whether the
> node has a binary property
> is quite high. What you are referring to was how it is implemented in
> Jackrabbit and
> the reference collection phase took days on larger repositories. But with
> the NodeStore specific implementation for
> blob reference collection this phase takes only a few hours. For example
> there is also an enhancement already implemented in oak-segment-tar
> to have the index of binary reference OAK-4201.
>
> Thanks
> Amit

Re: Datastore GC only possible after Tar Compaction

Posted by Amit Jain <am...@ieee.org>.
Hi,

On Mon, Oct 3, 2016 at 1:29 PM, Julian Sedding <js...@apache.org> wrote:

> I just became aware that on a system configured with SegmentNodeStore
> and FileDatastore a Datastore garbage collection can only free up
> space *after* a Tar Compaction was run.
>
>
Yes that is a pre-requisite.


> I would like to discuss whether it is desirable to require a Tar
> Compaction prior to a DS GC. If someone knows about the rationale
> behind this behaviour, I would also appreciate these insights!
>
> The alternative behaviour, which I would have expected, is to collect
> only binaries that are referenced from the root NodeState or any of
> the checkpoint's root NodeStates (i.e. "live" NodeStates).
>
> From an implementation perspective, I assume that the current
> behaviour can be implemented with better performance than a solution
> that checks only "live" NodeStates. However, IMHO that should not be
> the only relevant factor in the discussion.
>

I believe the performance impact of loading all nodes to check whether the
node has a binary property
is quite high. What you are referring to was how it is implemented in
Jackrabbit and
the reference collection phase took days on larger repositories. But with
the NodeStore specific implementation for
blob reference collection this phase takes only a few hours. For example
there is also an enhancement already implemented in oak-segment-tar
to have the index of binary reference OAK-4201.

Thanks
Amit