You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/03/31 12:39:08 UTC

Lifetime of revision identifiers

Hi,

The revision identifiers returned by methods like
MicroKernel.getHeadRevision() are plain strings so in theory I could
write one down on a piece of paper, lock it in a safe, and come back
ten years later expecting the identifier to give me access to the
repository content as it existed a decade ago.

Currently there's nothing in the documented MicroKernel contract that
prevents me from expecting that the above use case would work. This is
troublesome as it means that *no* past state of the repository should
ever be automatically cleaned out as garbage.

To allow automatic garbage collection without unexpectedly breaking
client expectations, we should define some rules on the expected
lifetime of revision identifiers. Without rules like that a client
can't even do the following without worrying about potential
interference from the garbage collector:

    String revision = mk.getHeadRevision();
    String root = mk.getNodes(revision, "/");

Since the revision identifiers are plain strings, we can't leverage
the standard garbage collector of the JVM and simply declare that all
revisions identifiers will remain valid for at least as long as they
are being referenced by some client. Thus a lease mechanism like
"revision identifiers remain valid for at least N minutes since last
access" may be needed. A client like a long-lived JCR Session would
then need to either periodically refresh to the latest revision or
extend its "lease" on an earlier revision.

Or we could combine these approaches by defining a Revision interface
for local Java clients and an accompanying Revision-String mapping
with defined lease handling for remote access.

WDYT?

BR,

Jukka Zitting

Re: Lifetime of revision identifiers

Posted by Michael Dürig <md...@apache.org>.
Hi,

Couldn't we just block a revision from being garbage collected for a 
certain (configurable) time each time it is accessed? That would 
correspond to the lease model where acquiring/extending the lease is 
implicitly done on access.

In general I wouldn't worry too much about garbage collection until we 
have some figures on how much garbage actually accumulates.

Michael

On 31.3.12 11:39, Jukka Zitting wrote:
> Hi,
>
> The revision identifiers returned by methods like
> MicroKernel.getHeadRevision() are plain strings so in theory I could
> write one down on a piece of paper, lock it in a safe, and come back
> ten years later expecting the identifier to give me access to the
> repository content as it existed a decade ago.
>
> Currently there's nothing in the documented MicroKernel contract that
> prevents me from expecting that the above use case would work. This is
> troublesome as it means that *no* past state of the repository should
> ever be automatically cleaned out as garbage.
>
> To allow automatic garbage collection without unexpectedly breaking
> client expectations, we should define some rules on the expected
> lifetime of revision identifiers. Without rules like that a client
> can't even do the following without worrying about potential
> interference from the garbage collector:
>
>      String revision = mk.getHeadRevision();
>      String root = mk.getNodes(revision, "/");
>
> Since the revision identifiers are plain strings, we can't leverage
> the standard garbage collector of the JVM and simply declare that all
> revisions identifiers will remain valid for at least as long as they
> are being referenced by some client. Thus a lease mechanism like
> "revision identifiers remain valid for at least N minutes since last
> access" may be needed. A client like a long-lived JCR Session would
> then need to either periodically refresh to the latest revision or
> extend its "lease" on an earlier revision.
>
> Or we could combine these approaches by defining a Revision interface
> for local Java clients and an accompanying Revision-String mapping
> with defined lease handling for remote access.
>
> WDYT?
>
> BR,
>
> Jukka Zitting


Re: Lifetime of revision identifiers

Posted by Michael Dürig <md...@apache.org>.

On 3.4.12 12:19, Dominique Pfister wrote:
> Hi,
>
> On Apr 3, 2012, at 12:50 PM, Jukka Zitting wrote:
>
>> Hi,
>>
>> On Tue, Apr 3, 2012 at 11:56 AM, Dominique Pfister<dp...@adobe.com>  wrote:
>>> On Apr 3, 2012, at 11:51 AM, Jukka Zitting wrote:
>>>> You'd drop revision identifiers from the MicroKernel interface? That's
>>>> a pretty big design change...
>>>
>>> No, I probably did not make myself clear: I would not keep a revision
>>> (and all its nodes) reachable in terms of garbage collection, simply
>>> because it was accessed by a client some time ago.
>>
>> If that's the case, I'm worried about what could happen to code like this:
>>
>>     String revision = mk.getHeadRevision();
>>     String root = getNodes("/", revision);
>>
>> Suppose someone else makes a commit in between the two calls and the
>> garbage collector gets triggered. The result then would be that the
>> getNodes() call will fail because the given revision identifier is no
>> longer available.
>
> If we have a delay of 10 minutes for revisions getting garbage collected, this would imply that 10 minutes passed between the first call and the second call, right? This seems rather unlikely.

This does actually *not imply* that 10 minutes pass between the calls. 
The first call might happen an arbitrary short time before the garbage 
collector decides to remove that revision. The second call might thus 
try to retrieve a revision which has in the meanwhile removed.

Michael


>
>>
>> And if you consider that an unlikely enough scenario, consider a case
>> where I want to then page through a potentially large list of the
>> child nodes:
>>
>>     int page_size = 10;
>>     long count = getChildNodeCount(root);
>>     for (long offset = 0; offset<  count; offset += page_size) {
>>         String children = mk.getNodes("/", revision, 1, offset,
>> page_size, null);
>>     }
>>
>> That could take a potentially long time, during which the revision
>> might well get garbage-collected. How should a client prepare for such
>> a situation?
>
> If simply iterating over this large list takes longer than the 10 minutes mentioned above, you'd have REALLY have a lot of child nodes. And if the client does some work in between (or waits for some other user interaction to continue paging), I guess it must be able to handle this situation gracefully.
>
> I'm just worried about the other extreme: if you have a lot of such clients requesting large child node lists on different head revisions, the garbage collector will never be able to actually collect a revision and space will run out soon.
>
> Dominique
>
>>
>> BR,
>>
>> Jukka Zitting
>

Re: Lifetime of revision identifiers

Posted by Michael Dürig <md...@apache.org>.
On 3.4.12 12:19, Dominique Pfister wrote:
> Hi,
>
> On Apr 3, 2012, at 12:50 PM, Jukka Zitting wrote:
>
>> Hi,
>>
>> On Tue, Apr 3, 2012 at 11:56 AM, Dominique Pfister<dp...@adobe.com>  wrote:
>>> On Apr 3, 2012, at 11:51 AM, Jukka Zitting wrote:
>>>> You'd drop revision identifiers from the MicroKernel interface? That's
>>>> a pretty big design change...
>>>
>>> No, I probably did not make myself clear: I would not keep a revision
>>> (and all its nodes) reachable in terms of garbage collection, simply
>>> because it was accessed by a client some time ago.
>>
>> If that's the case, I'm worried about what could happen to code like this:
>>
>>     String revision = mk.getHeadRevision();
>>     String root = getNodes("/", revision);
>>
>> Suppose someone else makes a commit in between the two calls and the
>> garbage collector gets triggered. The result then would be that the
>> getNodes() call will fail because the given revision identifier is no
>> longer available.
>
> If we have a delay of 10 minutes for revisions getting garbage collected, this would imply that 10 minutes passed between the first call and the second call, right? This seems rather unlikely.

10 minutes (like any value) seems quite arbitrary to me. I wouldn't want 
to fix deployments by fiddling around with this. Rather should clients 
be empowered to specify how long they need a certain revision (e.g. by a 
lease model as Jukka proposed).

>
>>
>> And if you consider that an unlikely enough scenario, consider a case
>> where I want to then page through a potentially large list of the
>> child nodes:
>>
>>     int page_size = 10;
>>     long count = getChildNodeCount(root);
>>     for (long offset = 0; offset<  count; offset += page_size) {
>>         String children = mk.getNodes("/", revision, 1, offset,
>> page_size, null);
>>     }
>>
>> That could take a potentially long time, during which the revision
>> might well get garbage-collected. How should a client prepare for such
>> a situation?
>
> If simply iterating over this large list takes longer than the 10 minutes mentioned above, you'd have REALLY have a lot of child nodes. And if the client does some work in between (or waits for some other user interaction to continue paging), I guess it must be able to handle this situation gracefully.
>
> I'm just worried about the other extreme: if you have a lot of such clients requesting large child node lists on different head revisions, the garbage collector will never be able to actually collect a revision and space will run out soon.

Do we have evidence on how fast things will grow? To me this feels very 
much like premature optimisation.

If a deployment runs out of space because the client application holds 
on to too many revisions for too long, this can be fixed by optimising 
the client and adjusting the store size to the actual client's 
requirements.

If OTHO clients fail because of an overly eager garbage collector, you 
will have to play dices with that 10 minutes interval *and* increase the 
store size.

Michael

>
> Dominique
>
>>
>> BR,
>>
>> Jukka Zitting
>

Re: Lifetime of revision identifiers

Posted by Dominique Pfister <dp...@adobe.com>.
Hi,

On Apr 3, 2012, at 12:50 PM, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Apr 3, 2012 at 11:56 AM, Dominique Pfister <dp...@adobe.com> wrote:
>> On Apr 3, 2012, at 11:51 AM, Jukka Zitting wrote:
>>> You'd drop revision identifiers from the MicroKernel interface? That's
>>> a pretty big design change...
>> 
>> No, I probably did not make myself clear: I would not keep a revision
>> (and all its nodes) reachable in terms of garbage collection, simply
>> because it was accessed by a client some time ago.
> 
> If that's the case, I'm worried about what could happen to code like this:
> 
>    String revision = mk.getHeadRevision();
>    String root = getNodes("/", revision);
> 
> Suppose someone else makes a commit in between the two calls and the
> garbage collector gets triggered. The result then would be that the
> getNodes() call will fail because the given revision identifier is no
> longer available.

If we have a delay of 10 minutes for revisions getting garbage collected, this would imply that 10 minutes passed between the first call and the second call, right? This seems rather unlikely.

> 
> And if you consider that an unlikely enough scenario, consider a case
> where I want to then page through a potentially large list of the
> child nodes:
> 
>    int page_size = 10;
>    long count = getChildNodeCount(root);
>    for (long offset = 0; offset < count; offset += page_size) {
>        String children = mk.getNodes("/", revision, 1, offset,
> page_size, null);
>    }
> 
> That could take a potentially long time, during which the revision
> might well get garbage-collected. How should a client prepare for such
> a situation?

If simply iterating over this large list takes longer than the 10 minutes mentioned above, you'd have REALLY have a lot of child nodes. And if the client does some work in between (or waits for some other user interaction to continue paging), I guess it must be able to handle this situation gracefully. 

I'm just worried about the other extreme: if you have a lot of such clients requesting large child node lists on different head revisions, the garbage collector will never be able to actually collect a revision and space will run out soon.

Dominique

> 
> BR,
> 
> Jukka Zitting


Re: Lifetime of revision identifiers

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Apr 3, 2012 at 11:56 AM, Dominique Pfister <dp...@adobe.com> wrote:
> On Apr 3, 2012, at 11:51 AM, Jukka Zitting wrote:
>> You'd drop revision identifiers from the MicroKernel interface? That's
>> a pretty big design change...
>
> No, I probably did not make myself clear: I would not keep a revision
> (and all its nodes) reachable in terms of garbage collection, simply
> because it was accessed by a client some time ago.

If that's the case, I'm worried about what could happen to code like this:

    String revision = mk.getHeadRevision();
    String root = getNodes("/", revision);

Suppose someone else makes a commit in between the two calls and the
garbage collector gets triggered. The result then would be that the
getNodes() call will fail because the given revision identifier is no
longer available.

And if you consider that an unlikely enough scenario, consider a case
where I want to then page through a potentially large list of the
child nodes:

    int page_size = 10;
    long count = getChildNodeCount(root);
    for (long offset = 0; offset < count; offset += page_size) {
        String children = mk.getNodes("/", revision, 1, offset,
page_size, null);
    }

That could take a potentially long time, during which the revision
might well get garbage-collected. How should a client prepare for such
a situation?

BR,

Jukka Zitting

Re: Lifetime of revision identifiers

Posted by Dominique Pfister <dp...@adobe.com>.
Hi,

On Apr 3, 2012, at 11:51 AM, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Apr 3, 2012 at 11:47 AM, Dominique Pfister <dp...@adobe.com> wrote:
>> I made a second thought, and I'm no longer sure I would allow
>> a revision to be reachable by some client interaction.
> 
> You'd drop revision identifiers from the MicroKernel interface? That's
> a pretty big design change...

No, I probably did not make myself clear: I would not keep a revision (and all its nodes) reachable in terms of garbage collection, simply because it was accessed by a client some time ago.

Dominique

> 
> BR,
> 
> Jukka Zitting


Re: Lifetime of revision identifiers

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Apr 3, 2012 at 11:47 AM, Dominique Pfister <dp...@adobe.com> wrote:
> I made a second thought, and I'm no longer sure I would allow
> a revision to be reachable by some client interaction.

You'd drop revision identifiers from the MicroKernel interface? That's
a pretty big design change...

BR,

Jukka Zitting

Re: Lifetime of revision identifiers

Posted by Dominique Pfister <dp...@adobe.com>.
Hi,

I made a second thought, and I'm no longer sure I would allow a revision to be reachable by some client interaction. In the current design, the GC will copy the head revision to the "to store" plus all the revisions that are either newly created (by some commit call coming in later) or still manipulated (by a commit that started earlier but where the internal commit builder is still not finished). I'd extend this design by copying all revisions that were created in some fixed interval (e.g. 10 minutes) before the head revision was created, and see whether this will suffice.

Regards
Dominique

On Apr 3, 2012, at 11:05 AM, Dominique Pfister wrote:

> Hi,
> 
> On Apr 2, 2012, at 7:28 PM, Jukka Zitting wrote:
> 
>> Hi,
>> 
>> On Mon, Apr 2, 2012 at 6:34 PM, Stefan Guggisberg
>> <st...@gmail.com> wrote:
>>> i don't think that we should allow clients to explicitly extend the life span
>>> of a specific revision. this would IMO unnecessarily complicate the GC
>>> logic and it would allow misbehaved clients to compromise the stability
>>> of the mk.
>> 
>> This would notably complicate things in oak-core and higher up. Any
>> large batch operations would have to worry about the underlying
>> revisions becoming unavailable unless they are continuously updated to
>> the latest head revision.
>> 
>> I don't think allowing lease extensions would complicate garbage
>> collection too much. All I'm asking is that the collector should look
>> at the "last access time" instead of the "create time" of a revision
>> to determine whether it's still referenceable or not.
> 
> Sounds reasonable, as long as you explicitely access the revision first, and then the nodes it contains. Things get more complicated if you'd "hang on" to some node in some revision and then expect that this revision stays alive.
> 
> Regards
> Dominique
> 
>> 
>> BR,
>> 
>> Jukka Zitting
> 


Re: Lifetime of revision identifiers

Posted by Dominique Pfister <dp...@adobe.com>.
Hi,

On Apr 2, 2012, at 7:28 PM, Jukka Zitting wrote:

> Hi,
> 
> On Mon, Apr 2, 2012 at 6:34 PM, Stefan Guggisberg
> <st...@gmail.com> wrote:
>> i don't think that we should allow clients to explicitly extend the life span
>> of a specific revision. this would IMO unnecessarily complicate the GC
>> logic and it would allow misbehaved clients to compromise the stability
>> of the mk.
> 
> This would notably complicate things in oak-core and higher up. Any
> large batch operations would have to worry about the underlying
> revisions becoming unavailable unless they are continuously updated to
> the latest head revision.
> 
> I don't think allowing lease extensions would complicate garbage
> collection too much. All I'm asking is that the collector should look
> at the "last access time" instead of the "create time" of a revision
> to determine whether it's still referenceable or not.

Sounds reasonable, as long as you explicitely access the revision first, and then the nodes it contains. Things get more complicated if you'd "hang on" to some node in some revision and then expect that this revision stays alive.

Regards
Dominique

> 
> BR,
> 
> Jukka Zitting


Re: Lifetime of revision identifiers

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Apr 2, 2012 at 6:34 PM, Stefan Guggisberg
<st...@gmail.com> wrote:
> i don't think that we should allow clients to explicitly extend the life span
> of a specific revision. this would IMO unnecessarily complicate the GC
> logic and it would allow misbehaved clients to compromise the stability
> of the mk.

This would notably complicate things in oak-core and higher up. Any
large batch operations would have to worry about the underlying
revisions becoming unavailable unless they are continuously updated to
the latest head revision.

I don't think allowing lease extensions would complicate garbage
collection too much. All I'm asking is that the collector should look
at the "last access time" instead of the "create time" of a revision
to determine whether it's still referenceable or not.

BR,

Jukka Zitting

Re: Lifetime of revision identifiers

Posted by Stefan Guggisberg <st...@gmail.com>.
On Sat, Mar 31, 2012 at 12:39 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> The revision identifiers returned by methods like
> MicroKernel.getHeadRevision() are plain strings so in theory I could
> write one down on a piece of paper, lock it in a safe, and come back
> ten years later expecting the identifier to give me access to the
> repository content as it existed a decade ago.
>
> Currently there's nothing in the documented MicroKernel contract that
> prevents me from expecting that the above use case would work. This is
> troublesome as it means that *no* past state of the repository should
> ever be automatically cleaned out as garbage.

right, there's currently no mention of lifetime of revisions in the javadoc.
i agree that this needs to be clearly specified.

>
> To allow automatic garbage collection without unexpectedly breaking
> client expectations, we should define some rules on the expected
> lifetime of revision identifiers. Without rules like that a client
> can't even do the following without worrying about potential
> interference from the garbage collector:
>
>    String revision = mk.getHeadRevision();
>    String root = mk.getNodes(revision, "/");
>
> Since the revision identifiers are plain strings, we can't leverage
> the standard garbage collector of the JVM and simply declare that all
> revisions identifiers will remain valid for at least as long as they
> are being referenced by some client. Thus a lease mechanism like
> "revision identifiers remain valid for at least N minutes since last
> access" may be needed. A client like a long-lived JCR Session would
> then need to either periodically refresh to the latest revision or
> extend its "lease" on an earlier revision.

i think that we should specify that revisions have a certain guaranteed
life span, e.g. N minutes, and that a client cannot expect to be able
to read a revision exceeding that life span.

however, what should the guaranteed life span be?
1  minute, 1 hour, 1 day?

i don't think that we should allow clients to explicitly extend the life span
of a specific revision. this would IMO unnecessarily complicate the GC
logic and it would allow misbehaved clients to compromise the stability
of the mk.

cheers
stefan

>
> Or we could combine these approaches by defining a Revision interface
> for local Java clients and an accompanying Revision-String mapping
> with defined lease handling for remote access.
>
> WDYT?
>
> BR,
>
> Jukka Zitting