You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Julian Sedding <js...@apache.org> on 2016/10/03 07:59:29 UTC
Datastore GC only possible after Tar Compaction
Hi all
I just became aware that on a system configured with SegmentNodeStore
and FileDatastore a Datastore garbage collection can only free up
space *after* a Tar Compaction was run.
This behaviour is not immediately intuitive to me.
I would like to discuss whether it is desirable to require a Tar
Compaction prior to a DS GC. If someone knows about the rationale
behind this behaviour, I would also appreciate these insights!
The alternative behaviour, which I would have expected, is to collect
only binaries that are referenced from the root NodeState or any of
the checkpoint's root NodeStates (i.e. "live" NodeStates).
From an implementation perspective, I assume that the current
behaviour can be implemented with better performance than a solution
that checks only "live" NodeStates. However, IMHO that should not be
the only relevant factor in the discussion.
I'm looking forward to your feedback!
Regards
Julian
Re: Datastore GC only possible after Tar Compaction
Posted by Michael Dürig <md...@apache.org>.
On 5.10.16 10:14 , Julian Sedding wrote:
> Would it be possible to improve the heuristic without traversing the
> node tree? I.e. do the segment tar files contain sufficient
> information in their indexes to safely determine that some binary
> references are dead? I'm looking for no false positives but possibly
> many false negatives.
Oak Segment Tar now has an index of binaries in the tar files. See
OAK-4201. This avoids having to traverse for reachability.
It suffers the same issue though: DSGC is only effective after a
revision gc. Since Oak Segment Tar implements a retention time based
revision gc model, this will also affect the DSCG. I agree this should
be better documents and I hope to do so once Oak Segment Tar is
sufficiently stabilised. See OAK-4292. Patches are always welcome though ;-)
Michael
Re: Datastore GC only possible after Tar Compaction
Posted by Julian Sedding <js...@gmail.com>.
Thanks Amit for your insights.
Is it documented that DS GC is ineffective if no prior tar compaction
is performed? IMHO we should make this as clear as possible, because
the behaviour deviates from JR2 and thus has the potential to throw
lots of users. Possibly even mention it as a possible reason in the
log message if DS GC was ineffective.
Would it be possible to improve the heuristic without traversing the
node tree? I.e. do the segment tar files contain sufficient
information in their indexes to safely determine that some binary
references are dead? I'm looking for no false positives but possibly
many false negatives.
Regards
Julian
On Mon, Oct 3, 2016 at 10:37 AM, Amit Jain <am...@ieee.org> wrote:
> Hi,
>
> On Mon, Oct 3, 2016 at 1:29 PM, Julian Sedding <js...@apache.org> wrote:
>
>> I just became aware that on a system configured with SegmentNodeStore
>> and FileDatastore a Datastore garbage collection can only free up
>> space *after* a Tar Compaction was run.
>>
>>
> Yes that is a pre-requisite.
>
>
>> I would like to discuss whether it is desirable to require a Tar
>> Compaction prior to a DS GC. If someone knows about the rationale
>> behind this behaviour, I would also appreciate these insights!
>>
>> The alternative behaviour, which I would have expected, is to collect
>> only binaries that are referenced from the root NodeState or any of
>> the checkpoint's root NodeStates (i.e. "live" NodeStates).
>>
>> From an implementation perspective, I assume that the current
>> behaviour can be implemented with better performance than a solution
>> that checks only "live" NodeStates. However, IMHO that should not be
>> the only relevant factor in the discussion.
>>
>
> I believe the performance impact of loading all nodes to check whether the
> node has a binary property
> is quite high. What you are referring to was how it is implemented in
> Jackrabbit and
> the reference collection phase took days on larger repositories. But with
> the NodeStore specific implementation for
> blob reference collection this phase takes only a few hours. For example
> there is also an enhancement already implemented in oak-segment-tar
> to have the index of binary reference OAK-4201.
>
> Thanks
> Amit
Re: Datastore GC only possible after Tar Compaction
Posted by Amit Jain <am...@ieee.org>.
Hi,
On Mon, Oct 3, 2016 at 1:29 PM, Julian Sedding <js...@apache.org> wrote:
> I just became aware that on a system configured with SegmentNodeStore
> and FileDatastore a Datastore garbage collection can only free up
> space *after* a Tar Compaction was run.
>
>
Yes that is a pre-requisite.
> I would like to discuss whether it is desirable to require a Tar
> Compaction prior to a DS GC. If someone knows about the rationale
> behind this behaviour, I would also appreciate these insights!
>
> The alternative behaviour, which I would have expected, is to collect
> only binaries that are referenced from the root NodeState or any of
> the checkpoint's root NodeStates (i.e. "live" NodeStates).
>
> From an implementation perspective, I assume that the current
> behaviour can be implemented with better performance than a solution
> that checks only "live" NodeStates. However, IMHO that should not be
> the only relevant factor in the discussion.
>
I believe the performance impact of loading all nodes to check whether the
node has a binary property
is quite high. What you are referring to was how it is implemented in
Jackrabbit and
the reference collection phase took days on larger repositories. But with
the NodeStore specific implementation for
blob reference collection this phase takes only a few hours. For example
there is also an enhancement already implemented in oak-segment-tar
to have the index of binary reference OAK-4201.
Thanks
Amit