You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2016/02/12 17:11:49 UTC

DocumentStore question.

Hi,
I am looking at [1], and probably confused.

Is there an assumption that the revisions listed in _revisions are ordered ?

If not, then how is the order of the revisions be determined, given that
the clocks on each node in a cluster will have different offsets ?

Best Regards
Ian


1 http://jackrabbit.apache.org/oak/docs/nodestore/documentmk.html

Re: DocumentStore question.

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

On 16/02/16 12:35, "ianboston@gmail.com on behalf of Ian Boston" wrote:
>Presumably, having a cluster node running behind real time will result in
>lower throughput, making it critical to run NTP on all cluster nodes to
>eliminate as much clock drift as possible ?

Yes, this is correct. Though, the clocks do not have to be
in perfect sync. Each DocumentNodeStore runs background
operations once a second. So, you probably won't notice
any delays even when the clock difference is one second.

>Also, does the current revision model behave with an eventually consistent
>storage mechanism, or does Oak require that the underlying storage is
>immediately consistent in nature ?

In general a DocumentStore implementation must be strongly
consistent, but there are API calls that allow to leverage
an eventually consistent backend. E.g. there is a
DocumentStore.find() variant with a maxCacheAge parameter.
This allows the current MongoDocumentStore implementation
to use a secondary for the read if applicable.

IMO the API may likely have to be changed to better support
such use cases. Related issues are OAK-2106 and OAK-3865.

Regards
 Marcel


Re: DocumentStore question.

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,
Thank you for the detailed explanation. I can now see how this works with a
consistent root document as the slow node effectively waits till its time
is ahead of the last root commit and it is clear to commit. This ensures
that all commits are sequential based on the revision timestamp.

Presumably, having a cluster node running behind real time will result in
lower throughput, making it critical to run NTP on all cluster nodes to
eliminate as much clock drift as possible ?

Also, does the current revision model behave with an eventually consistent
storage mechanism, or does Oak require that the underlying storage is
immediately consistent in nature ?

Best Regards
Ian


On 16 February 2016 at 10:36, Marcel Reutegger <mr...@adobe.com> wrote:

> Hi,
>
> On 16/02/16 09:56, "ianboston@gmail.com<ma...@gmail.com> on
> behalf of Ian Boston" wrote:
> So, IIUC, (based on Revision.compareTo(Revision) used by
> StableRevisionComparitor.
>
> yes.
>
> If one instance within a cluster has a clock that is lagging the others,
> and all instances are making changes at the same time, then the changes
> that the other instances make will be used, even the the lagging instance
> makes changes after (in real synchronised time) the others ?
>
> no, either cluster node has equal chances of getting its
> change in, but the other cluster node's change will be rejected.
>
> Let's assume we have two cluster nodes A and B and cluster node
> A's clock is lagging 5 seconds. Now both cluster nodes try to
> to set a property P on document D. One of the cluster nodes will be
> first to update document D. No matter, which cluster node is first,
> the second cluster node will see the previous change when it attempts
> the commit and will consider the change as not yet visible and
> in conflict with its own changes. The change of the second cluster
> node will therefore be rolled back.
>
> The behaviour of the cluster nodes will be different when external
> changes are pulled in from external cluster nodes. The background
> read operation of the DocumentNodeStore reads the most recent
> root document and compare the _lastRev entries of the other cluster
> nodes with its own clock (the _lastRev entries are the most recent
> commits visible to other cluster nodes). Here we have two cases:
>
> a) Cluster node A was successful to commit its change on P
>
> Cluster node A wrote a _lastRev on the root document for this
> change: r75-0-a. Cluster node B picks up that change and compares
> the revision with its own clock, which corresponds to r80-0-b
> (for readability, assuming for now the timestamp is a decimal
> and in seconds instead of milliseconds). Cluster node B will
> consider r75-0-a as visible from now on, because the timestamp
> of r80-0-b is newer than r75-0-a. From this point on Cluster
> node B can overwrite P again because it is able to see the most
> recent value set by A with r75-0-a.
>
> b) Cluster node B was successful to commit its change on P
>
> Cluster node B wrote a _lastRev on the root document for this
> change: r80-0-b. Cluster node A picks up that change and compares
> the revision with its own clock, which corresponds to r75-0-a.
> Cluster node A will still not consider r80-0-b as visible,
> because its own clock is considered behind. It will wait until
> its clock is passed r80-0-a. This makes a new change by A
> overwriting B's previous value of P, will have a newer timestamp
> than the previously made visible change of B.
>
> This means:
>
> 1) all changes considered visible can be compared with the
> StableRevisionComparator without the need to take clock
> differences into account.
>
> 2) a change will conflict if it is not the most recent
> revision (using StableRevisionComparator) or the other
> change is not yet visible but already committed.
>
>
> I can see that this won't matter for the majority of nodes, as collisions
> are rare, but won't the lagging instance be always overridden in the root
> document _revisions list ?
>
> Depending on usage, collisions are actually not that rare ;)
>
> The _revisions map on the root document contains just
> the commit entry. A cluster node cannot overwrite the
> entry of another cluster node, because they use unique
> revisions for commits. Each cluster node generates revisions
> with a unique clusterId suffix.
>
> Are there any plans to maintain a clock difference vector for the cluster ?
>
> Oak 1.0.x and 1.2.x still have something like this. See
> RevisionComparator. However, it only maintains the clock
> differences for the past 60 minutes.
>
> Oak 1.4 introduced a RevisionVector, which is inspired by
> version vectors [0].
>
> Regards
>  Marcel
>
> [0]
> https://issues.apache.org/jira/browse/OAK-3646?focusedCommentId=15028698&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15028698
>

Re: DocumentStore question.

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

On 16/02/16 09:56, "ianboston@gmail.com<ma...@gmail.com> on behalf of Ian Boston" wrote:
So, IIUC, (based on Revision.compareTo(Revision) used by
StableRevisionComparitor.

yes.

If one instance within a cluster has a clock that is lagging the others,
and all instances are making changes at the same time, then the changes
that the other instances make will be used, even the the lagging instance
makes changes after (in real synchronised time) the others ?

no, either cluster node has equal chances of getting its
change in, but the other cluster node's change will be rejected.

Let's assume we have two cluster nodes A and B and cluster node
A's clock is lagging 5 seconds. Now both cluster nodes try to
to set a property P on document D. One of the cluster nodes will be
first to update document D. No matter, which cluster node is first,
the second cluster node will see the previous change when it attempts
the commit and will consider the change as not yet visible and
in conflict with its own changes. The change of the second cluster
node will therefore be rolled back.

The behaviour of the cluster nodes will be different when external
changes are pulled in from external cluster nodes. The background
read operation of the DocumentNodeStore reads the most recent
root document and compare the _lastRev entries of the other cluster
nodes with its own clock (the _lastRev entries are the most recent
commits visible to other cluster nodes). Here we have two cases:

a) Cluster node A was successful to commit its change on P

Cluster node A wrote a _lastRev on the root document for this
change: r75-0-a. Cluster node B picks up that change and compares
the revision with its own clock, which corresponds to r80-0-b
(for readability, assuming for now the timestamp is a decimal
and in seconds instead of milliseconds). Cluster node B will
consider r75-0-a as visible from now on, because the timestamp
of r80-0-b is newer than r75-0-a. From this point on Cluster
node B can overwrite P again because it is able to see the most
recent value set by A with r75-0-a.

b) Cluster node B was successful to commit its change on P

Cluster node B wrote a _lastRev on the root document for this
change: r80-0-b. Cluster node A picks up that change and compares
the revision with its own clock, which corresponds to r75-0-a.
Cluster node A will still not consider r80-0-b as visible,
because its own clock is considered behind. It will wait until
its clock is passed r80-0-a. This makes a new change by A
overwriting B's previous value of P, will have a newer timestamp
than the previously made visible change of B.

This means:

1) all changes considered visible can be compared with the
StableRevisionComparator without the need to take clock
differences into account.

2) a change will conflict if it is not the most recent
revision (using StableRevisionComparator) or the other
change is not yet visible but already committed.


I can see that this won't matter for the majority of nodes, as collisions
are rare, but won't the lagging instance be always overridden in the root
document _revisions list ?

Depending on usage, collisions are actually not that rare ;)

The _revisions map on the root document contains just
the commit entry. A cluster node cannot overwrite the
entry of another cluster node, because they use unique
revisions for commits. Each cluster node generates revisions
with a unique clusterId suffix.

Are there any plans to maintain a clock difference vector for the cluster ?

Oak 1.0.x and 1.2.x still have something like this. See
RevisionComparator. However, it only maintains the clock
differences for the past 60 minutes.

Oak 1.4 introduced a RevisionVector, which is inspired by
version vectors [0].

Regards
 Marcel

[0] https://issues.apache.org/jira/browse/OAK-3646?focusedCommentId=15028698&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15028698

Re: DocumentStore question.

Posted by Ian Boston <ie...@tfd.co.uk>.
On 15 February 2016 at 14:49, Marcel Reutegger <mr...@adobe.com> wrote:

> Hi,
>
> On 12/02/16 17:11, "ianboston@gmail.com on behalf of Ian Boston" wrote:
> >Is there an assumption that the revisions listed in _revisions are
> >ordered ?
>
> There is no requirement that entries in _revisions map
> are ordered at the storage layer, but the DocumentStore
> will order them when it reads the entries. The entries
> are sorted according to the timestamp of the revision,
> then revision counter and finally clusterId.
>
> >If not, then how is the order of the revisions be determined, given that
> >the clocks on each node in a cluster will have different offsets ?
>
> Oak 1.0.x and 1.2.x maintain a revision table (in RevisionComparator)
> for each cluster node, which allows it to compare revision across
> cluster nodes even when there are clock differences. At least for
> the 60 minutes timeframe covered by the RevisionComparator.
>
> Oak 1.4 uses revision vectors and does not maintain a revision
> table anymore. See OAK-3646. At the same time it also simplifies
> how revisions are compared and how changes are pulled in from
> other cluster nodes. The background read operation ensures that
> external changes made visible all have a lower revision timestamp
> than the local clock. This ensure that all local changes from that
> point on will have a higher revision timestamp than externally
> visible changes. This part was also backported to 1.0 and 1.2.
> See OAK-3388.
>


So, IIUC, (based on Revision.compareTo(Revision) used by
StableRevisionComparitor.

If one instance within a cluster has a clock that is lagging the others,
and all instances are making changes at the same time, then the changes
that the other instances make will be used, even the the lagging instance
makes changes after (in real synchronised time) the others ?

I can see that this won't matter for the majority of nodes, as collisions
are rare, but won't the lagging instance be always overridden in the root
document _revisions list ?

Are there any plans to maintain a clock difference vector for the cluster ?

Best Regards
Ian




>
> Regards
>  Marcel
>
>

Re: DocumentStore question.

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

On 12/02/16 17:11, "ianboston@gmail.com on behalf of Ian Boston" wrote:
>Is there an assumption that the revisions listed in _revisions are
>ordered ?

There is no requirement that entries in _revisions map
are ordered at the storage layer, but the DocumentStore
will order them when it reads the entries. The entries
are sorted according to the timestamp of the revision,
then revision counter and finally clusterId.

>If not, then how is the order of the revisions be determined, given that
>the clocks on each node in a cluster will have different offsets ?

Oak 1.0.x and 1.2.x maintain a revision table (in RevisionComparator)
for each cluster node, which allows it to compare revision across
cluster nodes even when there are clock differences. At least for
the 60 minutes timeframe covered by the RevisionComparator.

Oak 1.4 uses revision vectors and does not maintain a revision
table anymore. See OAK-3646. At the same time it also simplifies
how revisions are compared and how changes are pulled in from
other cluster nodes. The background read operation ensures that
external changes made visible all have a lower revision timestamp
than the local clock. This ensure that all local changes from that
point on will have a higher revision timestamp than externally
visible changes. This part was also backported to 1.0 and 1.2.
See OAK-3388.

Regards
 Marcel