You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2011/02/18 03:45:22 UTC

idea for transitive replication checkpoints

Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.

The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:

1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:

On A
{A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source

On B
{A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
{B, _local/Baz, Bif, 15, _TargetSeq}

On C
{B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C

We know that A -> B happened before B -> C.

2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like

{A, _local/Foo, Bar, 5}

which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.

3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.

If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.

Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,

Adam

Re: idea for transitive replication checkpoints

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Feb 18, 2011 at 2:33 PM, Adam Kocoloski <ko...@apache.org> wrote:
> On Feb 18, 2011, at 11:16 AM, Paul Davis wrote:
>
>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
>>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
>>>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <pa...@gmail.com> wrote:
>>>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
>>>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>>>>>>
>>>>>> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>>>>>>
>>>>>> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>>>>>>
>>>>>> On A
>>>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>>>>>>
>>>>>> On B
>>>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>>>
>>>>>> On C
>>>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>>>
>>>>>> We know that A -> B happened before B -> C.
>>>>>>
>>>>>> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>>>>>>
>>>>>> {A, _local/Foo, Bar, 5}
>>>>>>
>>>>>> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>>>>>>
>>>>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>>>>>>
>>>>>> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>>>>>>
>>>>>> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>>>>>>
>>>>>> Adam
>>>>>
>>>>> What Adam said. Also, I was just doing a brain dump and I think I
>>>>> might've punched a gaping whole into the whole scenario. I'm not
>>>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>>>> just not seeing the easy way around it.
>>>>>
>>>>> There's also a picture of the end of our white board session at
>>>>> http://plixi.com/p/78268064 which probably means little to nothing
>>>>> without the context of having seen it drawn and the ensuing argument
>>>>> and wild gesticulations. But its there for posterity.
>>>>>
>>>>> <brain_dump>
>>>>>
>>>>> Transitive Replication - The Idea
>>>>> =================================
>>>>>
>>>>> Consider the following scenario:
>>>>>
>>>>> 1. Replicate A -> B
>>>>> 2. Replicate B -> C
>>>>> 3. Replicate A -> C
>>>>>
>>>>> For simplicity's sake, assume no writes occur during this scenario. The
>>>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>>>
>>>>> Current Situation
>>>>> =================
>>>>>
>>>>> Replication state is based on a pair-wise state reflecting source and
>>>>> target information (and filter functions etc). For the above scenario to
>>>>> be anywhere near plausible a couple things need to happen. First, we'll
>>>>> obviously need to transfer data from B -> C during replication so it
>>>>> has knowledge about A. This information will have to be complete enough
>>>>> to short circuit (or skip part of) a replication from A.
>>>>>
>>>>> The information that B sends to C will need to enable a replication from
>>>>> A to C to occur without error in any sort of pathological state of A
>>>>> irregardless of what state C thinks A is in. Changes in state may include
>>>>> A "forgetting" some edits and resetting to a point in time the state
>>>>> that C has (for instance, A crashed and was recovered to a previous
>>>>> point in time).
>>>>>
>>>>> C will also need to be able to uniquely identify A regardless of host or
>>>>> other transitory characteristics.
>>>>>
>>>>> An Old Proposition
>>>>> ==================
>>>>>
>>>>> There's been a proposal floated a few times for a few different reasons
>>>>> to give each database a UUID so that it is uniquely identifiable for
>>>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>>>> allow us to uniquely identify a database in the above scenario.
>>>>>
>>>>> The first issue with db UUID's that always pops up is that we have to
>>>>> address the case of what happens when someone copies a database (perhaps
>>>>> to short circuit an initial replication, or restoring a db when a
>>>>> machine fails) is that the UUID may no longer be globally unique.
>>>>>
>>>>> This would need to be fixed for transitive replication to have any
>>>>> chance of working. One solution that was mentioned was to have each
>>>>> CouchDB node remember all UUID's that it knows about and if a db is
>>>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>>>
>>>>> This could be accomplished efficiently by storing _local docs in the
>>>>> replicator database that reference known UUID/dbname pairs. Then we
>>>>> just lookup the UUID on db open and if it doesn't match the db name
>>>>> we reset it.
>>>>>
>>>>> For upgrade compatibility and the ability to change UUID's often we
>>>>> could just store the UUID in the db header (as opposed to the first
>>>>> sixteen bytes of the file).
>>>>>
>>>>> Information Propagation Requirements
>>>>> ====================================
>>>>>
>>>>> When replication occurs we need to inform the target database of a
>>>>> few pieces of information so that it knows about transitive replications
>>>>> that it contains. We also need to make sure that the target db doesn't
>>>>> learn about this information before it contains the entire replica set
>>>>> and it needs to be processed in such a way that it doesn't require
>>>>> complete replications.
>>>>>
>>>>> These requirements pretty much lead us to the fact that the replica
>>>>> state will need to be beamed across as the target receives information
>>>>> from the source update sequence. Ie, when we iterate the _changes feed
>>>>> we get extra info when we've arrived an update_seq that wholly contains
>>>>> some prior replication from an arbitrary node to the *source*.
>>>>>
>>>>> Information to Propagate
>>>>> ========================
>>>>>
>>>>> Now we need to consider what information needs to exist on a db in
>>>>> order to figure out if we *can* short circuit a replication as well as
>>>>> where we fast forward *to*.
>>>>>
>>>>> One obvious piece of information is the UUID of the database stream. A
>>>>> second piece would be the update_seq for that UUID. After some thought
>>>>> we also realize we need to store some more information to check if that
>>>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>>>
>>>>> The case that could invalidate a pair is if a database crashes and it
>>>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>>>> thought experiment. Now at some point after C learns of A, A crashes and
>>>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>>>> than we wrote before the restore. This divergence in history would not
>>>>> be detectable without extra information.
>>>>>
>>>>> After much hand waving about rolling hashes, Adam decided to remember
>>>>> that we store a replication history between two db's. This can be
>>>>> represented as a _local doc id that includes information on the pair
>>>>> of db's as well as a random session id. If we include this data with
>>>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>>>> we can check that this record still exists.
>>>>>
>>>>> In the case of the crash/restore even if we go and make the same edits
>>>>> and even have a similar replication history, the randomness to the
>>>>> session id will inform us that something has gone awry and we need to
>>>>> run a full replication to make sure we have all history.
>>>>>
>>>>>
>>>>> Information Required to Trigger Propagation
>>>>> ===========================================
>>>>>
>>>>> Along with the four pieces of information mentioned above, we also need
>>>>> to store what update_seq in the target database was the *result* of a
>>>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>>>> update_seq of that replication transaction. This is so that when B
>>>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>>>> very beginning because the replication might fail before all of the
>>>>> info from A is replicated. We also can't wait until the end because then
>>>>> C may never learn of A because of failure.
>>>>>
>>>>> This means that we need to know for a given update_seq if after it has
>>>>> been replicated, C can suddenly fast-forward a replication with someone
>>>>> other than B. To do this B will need to be able to stream its update
>>>>> sequence and efficiently check if that completes some replication record
>>>>> that C should know about.
>>>>>
>>>>> We might quickly assume that storing this in the existing update seq
>>>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>>>> replicating to C while someone starts updating the doc for update_seq
>>>>> 6 on B. Its possible that a series of events could lead to C never
>>>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>>>> to the latest update_seq.
>>>>>
>>>>> The proper way to fix this would be to insert code that says "when an
>>>>> update_seq entry is updated, move its replication info to the next update
>>>>> seq" which sounds like it could get really quite wonky.
>>>>>
>>>>> So the solution would be to have some sort of indexed structure of
>>>>> replication records that can be scanned to know when to send out some
>>>>> replication finished....
>>>>>
>>>>> Ruh Roh
>>>>> =======
>>>>>
>>>>> I just realized something wonky with this whole plan. We *don't*
>>>>> necessarily know when a replication ends because of update sequences. For
>>>>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>>>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>>>
>>>>> This could be a huge gaping whole. Someone prove me wrong.
>>>>>
>>>>> Storing Replication State
>>>>> =========================
>>>>>
>>>>> With this new piece of information we'll also require some way to store
>>>>> replication state. This should hopefully be hand-wavy trivial by just
>>>>> storing replication records in _local docs very similarly to how they're
>>>>> currently stored.
>>>>>
>>>>> </brain_dump>
>>>>>
>>>>
>>>> The important point of my ruh roh to realize that I failed to
>>>> articulate, the reason that this is bad is that if when we edit the
>>>> doc on B before replication to C, C *can't* know what's on A until it
>>>> gets to the new version of the doc in B. This coupled with the fact
>>>> that we can edit anything on B, and that they all jump to the end
>>>> makes me think that we'd have to do some more extensive bookkeeping to
>>>> make sure that C doesn't know about B until after all of A's docs get
>>>> pushed.
>>>>
>>>> Blargghhh....
>>>>
>>>
>>> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
>>>
>>
>> After sleeping on it, I think that this doesn't shoot the whole idea
>> out of the sky, but requires us to only send the info when a
>> replication manages to reach the end of the update_seq btree in a
>> single db snapshot. I'm not sure if that means that it'd be out of the
>> question for continuous replication or not.
>
> Hi Paul, thanks for this articulate writeup.  I think you're correct in this last email, we can only send these extra bits of information about other replications whenever we've reached the end of an MVCC snapshot from the current source.  That shouldn't be a problem for continuous replication, since under the hood it's implemented as a loop of "open / walk seq_tree / wait for new updates" calls.  We can just send any new transitive checkpoints that we encountered during the current walk just before going into the "wait for new updates" step.
>
> Adam

heh, articulate.

Sounds good for continuous replication. I wasn't certain if it just
held on to an open changes feed or not, but if not then it sounds like
it'll all be clean enough.

Re: idea for transitive replication checkpoints

Posted by Adam Kocoloski <ko...@apache.org>.
On Feb 18, 2011, at 11:16 AM, Paul Davis wrote:

> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
>>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <pa...@gmail.com> wrote:
>>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
>>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>>>>> 
>>>>> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>>>>> 
>>>>> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>>>>> 
>>>>> On A
>>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>>>>> 
>>>>> On B
>>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>> 
>>>>> On C
>>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>> 
>>>>> We know that A -> B happened before B -> C.
>>>>> 
>>>>> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>>>>> 
>>>>> {A, _local/Foo, Bar, 5}
>>>>> 
>>>>> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>>>>> 
>>>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>>>>> 
>>>>> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>>>>> 
>>>>> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>>>>> 
>>>>> Adam
>>>> 
>>>> What Adam said. Also, I was just doing a brain dump and I think I
>>>> might've punched a gaping whole into the whole scenario. I'm not
>>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>>> just not seeing the easy way around it.
>>>> 
>>>> There's also a picture of the end of our white board session at
>>>> http://plixi.com/p/78268064 which probably means little to nothing
>>>> without the context of having seen it drawn and the ensuing argument
>>>> and wild gesticulations. But its there for posterity.
>>>> 
>>>> <brain_dump>
>>>> 
>>>> Transitive Replication - The Idea
>>>> =================================
>>>> 
>>>> Consider the following scenario:
>>>> 
>>>> 1. Replicate A -> B
>>>> 2. Replicate B -> C
>>>> 3. Replicate A -> C
>>>> 
>>>> For simplicity's sake, assume no writes occur during this scenario. The
>>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>> 
>>>> Current Situation
>>>> =================
>>>> 
>>>> Replication state is based on a pair-wise state reflecting source and
>>>> target information (and filter functions etc). For the above scenario to
>>>> be anywhere near plausible a couple things need to happen. First, we'll
>>>> obviously need to transfer data from B -> C during replication so it
>>>> has knowledge about A. This information will have to be complete enough
>>>> to short circuit (or skip part of) a replication from A.
>>>> 
>>>> The information that B sends to C will need to enable a replication from
>>>> A to C to occur without error in any sort of pathological state of A
>>>> irregardless of what state C thinks A is in. Changes in state may include
>>>> A "forgetting" some edits and resetting to a point in time the state
>>>> that C has (for instance, A crashed and was recovered to a previous
>>>> point in time).
>>>> 
>>>> C will also need to be able to uniquely identify A regardless of host or
>>>> other transitory characteristics.
>>>> 
>>>> An Old Proposition
>>>> ==================
>>>> 
>>>> There's been a proposal floated a few times for a few different reasons
>>>> to give each database a UUID so that it is uniquely identifiable for
>>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>>> allow us to uniquely identify a database in the above scenario.
>>>> 
>>>> The first issue with db UUID's that always pops up is that we have to
>>>> address the case of what happens when someone copies a database (perhaps
>>>> to short circuit an initial replication, or restoring a db when a
>>>> machine fails) is that the UUID may no longer be globally unique.
>>>> 
>>>> This would need to be fixed for transitive replication to have any
>>>> chance of working. One solution that was mentioned was to have each
>>>> CouchDB node remember all UUID's that it knows about and if a db is
>>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>> 
>>>> This could be accomplished efficiently by storing _local docs in the
>>>> replicator database that reference known UUID/dbname pairs. Then we
>>>> just lookup the UUID on db open and if it doesn't match the db name
>>>> we reset it.
>>>> 
>>>> For upgrade compatibility and the ability to change UUID's often we
>>>> could just store the UUID in the db header (as opposed to the first
>>>> sixteen bytes of the file).
>>>> 
>>>> Information Propagation Requirements
>>>> ====================================
>>>> 
>>>> When replication occurs we need to inform the target database of a
>>>> few pieces of information so that it knows about transitive replications
>>>> that it contains. We also need to make sure that the target db doesn't
>>>> learn about this information before it contains the entire replica set
>>>> and it needs to be processed in such a way that it doesn't require
>>>> complete replications.
>>>> 
>>>> These requirements pretty much lead us to the fact that the replica
>>>> state will need to be beamed across as the target receives information
>>>> from the source update sequence. Ie, when we iterate the _changes feed
>>>> we get extra info when we've arrived an update_seq that wholly contains
>>>> some prior replication from an arbitrary node to the *source*.
>>>> 
>>>> Information to Propagate
>>>> ========================
>>>> 
>>>> Now we need to consider what information needs to exist on a db in
>>>> order to figure out if we *can* short circuit a replication as well as
>>>> where we fast forward *to*.
>>>> 
>>>> One obvious piece of information is the UUID of the database stream. A
>>>> second piece would be the update_seq for that UUID. After some thought
>>>> we also realize we need to store some more information to check if that
>>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>> 
>>>> The case that could invalidate a pair is if a database crashes and it
>>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>>> thought experiment. Now at some point after C learns of A, A crashes and
>>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>>> than we wrote before the restore. This divergence in history would not
>>>> be detectable without extra information.
>>>> 
>>>> After much hand waving about rolling hashes, Adam decided to remember
>>>> that we store a replication history between two db's. This can be
>>>> represented as a _local doc id that includes information on the pair
>>>> of db's as well as a random session id. If we include this data with
>>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>>> we can check that this record still exists.
>>>> 
>>>> In the case of the crash/restore even if we go and make the same edits
>>>> and even have a similar replication history, the randomness to the
>>>> session id will inform us that something has gone awry and we need to
>>>> run a full replication to make sure we have all history.
>>>> 
>>>> 
>>>> Information Required to Trigger Propagation
>>>> ===========================================
>>>> 
>>>> Along with the four pieces of information mentioned above, we also need
>>>> to store what update_seq in the target database was the *result* of a
>>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>>> update_seq of that replication transaction. This is so that when B
>>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>>> very beginning because the replication might fail before all of the
>>>> info from A is replicated. We also can't wait until the end because then
>>>> C may never learn of A because of failure.
>>>> 
>>>> This means that we need to know for a given update_seq if after it has
>>>> been replicated, C can suddenly fast-forward a replication with someone
>>>> other than B. To do this B will need to be able to stream its update
>>>> sequence and efficiently check if that completes some replication record
>>>> that C should know about.
>>>> 
>>>> We might quickly assume that storing this in the existing update seq
>>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>>> replicating to C while someone starts updating the doc for update_seq
>>>> 6 on B. Its possible that a series of events could lead to C never
>>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>>> to the latest update_seq.
>>>> 
>>>> The proper way to fix this would be to insert code that says "when an
>>>> update_seq entry is updated, move its replication info to the next update
>>>> seq" which sounds like it could get really quite wonky.
>>>> 
>>>> So the solution would be to have some sort of indexed structure of
>>>> replication records that can be scanned to know when to send out some
>>>> replication finished....
>>>> 
>>>> Ruh Roh
>>>> =======
>>>> 
>>>> I just realized something wonky with this whole plan. We *don't*
>>>> necessarily know when a replication ends because of update sequences. For
>>>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>> 
>>>> This could be a huge gaping whole. Someone prove me wrong.
>>>> 
>>>> Storing Replication State
>>>> =========================
>>>> 
>>>> With this new piece of information we'll also require some way to store
>>>> replication state. This should hopefully be hand-wavy trivial by just
>>>> storing replication records in _local docs very similarly to how they're
>>>> currently stored.
>>>> 
>>>> </brain_dump>
>>>> 
>>> 
>>> The important point of my ruh roh to realize that I failed to
>>> articulate, the reason that this is bad is that if when we edit the
>>> doc on B before replication to C, C *can't* know what's on A until it
>>> gets to the new version of the doc in B. This coupled with the fact
>>> that we can edit anything on B, and that they all jump to the end
>>> makes me think that we'd have to do some more extensive bookkeeping to
>>> make sure that C doesn't know about B until after all of A's docs get
>>> pushed.
>>> 
>>> Blargghhh....
>>> 
>> 
>> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
>> 
> 
> After sleeping on it, I think that this doesn't shoot the whole idea
> out of the sky, but requires us to only send the info when a
> replication manages to reach the end of the update_seq btree in a
> single db snapshot. I'm not sure if that means that it'd be out of the
> question for continuous replication or not.

Hi Paul, thanks for this articulate writeup.  I think you're correct in this last email, we can only send these extra bits of information about other replications whenever we've reached the end of an MVCC snapshot from the current source.  That shouldn't be a problem for continuous replication, since under the hood it's implemented as a loop of "open / walk seq_tree / wait for new updates" calls.  We can just send any new transitive checkpoints that we encountered during the current walk just before going into the "wait for new updates" step.

Adam

Re: idea for transitive replication checkpoints

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <pa...@gmail.com> wrote:
>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>>>>
>>>> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>>>>
>>>> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>>>>
>>>> On A
>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>>>>
>>>> On B
>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>
>>>> On C
>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>
>>>> We know that A -> B happened before B -> C.
>>>>
>>>> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>>>>
>>>> {A, _local/Foo, Bar, 5}
>>>>
>>>> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>>>>
>>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>>>>
>>>> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>>>>
>>>> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>>>>
>>>> Adam
>>>
>>> What Adam said. Also, I was just doing a brain dump and I think I
>>> might've punched a gaping whole into the whole scenario. I'm not
>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>> just not seeing the easy way around it.
>>>
>>> There's also a picture of the end of our white board session at
>>> http://plixi.com/p/78268064 which probably means little to nothing
>>> without the context of having seen it drawn and the ensuing argument
>>> and wild gesticulations. But its there for posterity.
>>>
>>> <brain_dump>
>>>
>>> Transitive Replication - The Idea
>>> =================================
>>>
>>> Consider the following scenario:
>>>
>>> 1. Replicate A -> B
>>> 2. Replicate B -> C
>>> 3. Replicate A -> C
>>>
>>> For simplicity's sake, assume no writes occur during this scenario. The
>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>
>>> Current Situation
>>> =================
>>>
>>> Replication state is based on a pair-wise state reflecting source and
>>> target information (and filter functions etc). For the above scenario to
>>> be anywhere near plausible a couple things need to happen. First, we'll
>>> obviously need to transfer data from B -> C during replication so it
>>> has knowledge about A. This information will have to be complete enough
>>> to short circuit (or skip part of) a replication from A.
>>>
>>> The information that B sends to C will need to enable a replication from
>>> A to C to occur without error in any sort of pathological state of A
>>> irregardless of what state C thinks A is in. Changes in state may include
>>> A "forgetting" some edits and resetting to a point in time the state
>>> that C has (for instance, A crashed and was recovered to a previous
>>> point in time).
>>>
>>> C will also need to be able to uniquely identify A regardless of host or
>>> other transitory characteristics.
>>>
>>> An Old Proposition
>>> ==================
>>>
>>> There's been a proposal floated a few times for a few different reasons
>>> to give each database a UUID so that it is uniquely identifiable for
>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>> allow us to uniquely identify a database in the above scenario.
>>>
>>> The first issue with db UUID's that always pops up is that we have to
>>> address the case of what happens when someone copies a database (perhaps
>>> to short circuit an initial replication, or restoring a db when a
>>> machine fails) is that the UUID may no longer be globally unique.
>>>
>>> This would need to be fixed for transitive replication to have any
>>> chance of working. One solution that was mentioned was to have each
>>> CouchDB node remember all UUID's that it knows about and if a db is
>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>
>>> This could be accomplished efficiently by storing _local docs in the
>>> replicator database that reference known UUID/dbname pairs. Then we
>>> just lookup the UUID on db open and if it doesn't match the db name
>>> we reset it.
>>>
>>> For upgrade compatibility and the ability to change UUID's often we
>>> could just store the UUID in the db header (as opposed to the first
>>> sixteen bytes of the file).
>>>
>>> Information Propagation Requirements
>>> ====================================
>>>
>>> When replication occurs we need to inform the target database of a
>>> few pieces of information so that it knows about transitive replications
>>> that it contains. We also need to make sure that the target db doesn't
>>> learn about this information before it contains the entire replica set
>>> and it needs to be processed in such a way that it doesn't require
>>> complete replications.
>>>
>>> These requirements pretty much lead us to the fact that the replica
>>> state will need to be beamed across as the target receives information
>>> from the source update sequence. Ie, when we iterate the _changes feed
>>> we get extra info when we've arrived an update_seq that wholly contains
>>> some prior replication from an arbitrary node to the *source*.
>>>
>>> Information to Propagate
>>> ========================
>>>
>>> Now we need to consider what information needs to exist on a db in
>>> order to figure out if we *can* short circuit a replication as well as
>>> where we fast forward *to*.
>>>
>>> One obvious piece of information is the UUID of the database stream. A
>>> second piece would be the update_seq for that UUID. After some thought
>>> we also realize we need to store some more information to check if that
>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>
>>> The case that could invalidate a pair is if a database crashes and it
>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>> thought experiment. Now at some point after C learns of A, A crashes and
>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>> than we wrote before the restore. This divergence in history would not
>>> be detectable without extra information.
>>>
>>> After much hand waving about rolling hashes, Adam decided to remember
>>> that we store a replication history between two db's. This can be
>>> represented as a _local doc id that includes information on the pair
>>> of db's as well as a random session id. If we include this data with
>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>> we can check that this record still exists.
>>>
>>> In the case of the crash/restore even if we go and make the same edits
>>> and even have a similar replication history, the randomness to the
>>> session id will inform us that something has gone awry and we need to
>>> run a full replication to make sure we have all history.
>>>
>>>
>>> Information Required to Trigger Propagation
>>> ===========================================
>>>
>>> Along with the four pieces of information mentioned above, we also need
>>> to store what update_seq in the target database was the *result* of a
>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>> update_seq of that replication transaction. This is so that when B
>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>> very beginning because the replication might fail before all of the
>>> info from A is replicated. We also can't wait until the end because then
>>> C may never learn of A because of failure.
>>>
>>> This means that we need to know for a given update_seq if after it has
>>> been replicated, C can suddenly fast-forward a replication with someone
>>> other than B. To do this B will need to be able to stream its update
>>> sequence and efficiently check if that completes some replication record
>>> that C should know about.
>>>
>>> We might quickly assume that storing this in the existing update seq
>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>> replicating to C while someone starts updating the doc for update_seq
>>> 6 on B. Its possible that a series of events could lead to C never
>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>> to the latest update_seq.
>>>
>>> The proper way to fix this would be to insert code that says "when an
>>> update_seq entry is updated, move its replication info to the next update
>>> seq" which sounds like it could get really quite wonky.
>>>
>>> So the solution would be to have some sort of indexed structure of
>>> replication records that can be scanned to know when to send out some
>>> replication finished....
>>>
>>> Ruh Roh
>>> =======
>>>
>>> I just realized something wonky with this whole plan. We *don't*
>>> necessarily know when a replication ends because of update sequences. For
>>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>
>>> This could be a huge gaping whole. Someone prove me wrong.
>>>
>>> Storing Replication State
>>> =========================
>>>
>>> With this new piece of information we'll also require some way to store
>>> replication state. This should hopefully be hand-wavy trivial by just
>>> storing replication records in _local docs very similarly to how they're
>>> currently stored.
>>>
>>> </brain_dump>
>>>
>>
>> The important point of my ruh roh to realize that I failed to
>> articulate, the reason that this is bad is that if when we edit the
>> doc on B before replication to C, C *can't* know what's on A until it
>> gets to the new version of the doc in B. This coupled with the fact
>> that we can edit anything on B, and that they all jump to the end
>> makes me think that we'd have to do some more extensive bookkeeping to
>> make sure that C doesn't know about B until after all of A's docs get
>> pushed.
>>
>> Blargghhh....
>>
>
> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
>

After sleeping on it, I think that this doesn't shoot the whole idea
out of the sky, but requires us to only send the info when a
replication manages to reach the end of the update_seq btree in a
single db snapshot. I'm not sure if that means that it'd be out of the
question for continuous replication or not.

Re: idea for transitive replication checkpoints

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <pa...@gmail.com> wrote:
> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <pa...@gmail.com> wrote:
>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>>>
>>> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>>>
>>> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>>>
>>> On A
>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>>>
>>> On B
>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>
>>> On C
>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>
>>> We know that A -> B happened before B -> C.
>>>
>>> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>>>
>>> {A, _local/Foo, Bar, 5}
>>>
>>> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>>>
>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>>>
>>> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>>>
>>> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>>>
>>> Adam
>>
>> What Adam said. Also, I was just doing a brain dump and I think I
>> might've punched a gaping whole into the whole scenario. I'm not
>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>> towards the end where my brain dump froze up. Its late so maybe I'm
>> just not seeing the easy way around it.
>>
>> There's also a picture of the end of our white board session at
>> http://plixi.com/p/78268064 which probably means little to nothing
>> without the context of having seen it drawn and the ensuing argument
>> and wild gesticulations. But its there for posterity.
>>
>> <brain_dump>
>>
>> Transitive Replication - The Idea
>> =================================
>>
>> Consider the following scenario:
>>
>> 1. Replicate A -> B
>> 2. Replicate B -> C
>> 3. Replicate A -> C
>>
>> For simplicity's sake, assume no writes occur during this scenario. The
>> question is why can't we short circuit step 3 to effectively be a no-op?
>>
>> Current Situation
>> =================
>>
>> Replication state is based on a pair-wise state reflecting source and
>> target information (and filter functions etc). For the above scenario to
>> be anywhere near plausible a couple things need to happen. First, we'll
>> obviously need to transfer data from B -> C during replication so it
>> has knowledge about A. This information will have to be complete enough
>> to short circuit (or skip part of) a replication from A.
>>
>> The information that B sends to C will need to enable a replication from
>> A to C to occur without error in any sort of pathological state of A
>> irregardless of what state C thinks A is in. Changes in state may include
>> A "forgetting" some edits and resetting to a point in time the state
>> that C has (for instance, A crashed and was recovered to a previous
>> point in time).
>>
>> C will also need to be able to uniquely identify A regardless of host or
>> other transitory characteristics.
>>
>> An Old Proposition
>> ==================
>>
>> There's been a proposal floated a few times for a few different reasons
>> to give each database a UUID so that it is uniquely identifiable for
>> various reasons (ETags come to mind). Such a UUID were it to exist would
>> allow us to uniquely identify a database in the above scenario.
>>
>> The first issue with db UUID's that always pops up is that we have to
>> address the case of what happens when someone copies a database (perhaps
>> to short circuit an initial replication, or restoring a db when a
>> machine fails) is that the UUID may no longer be globally unique.
>>
>> This would need to be fixed for transitive replication to have any
>> chance of working. One solution that was mentioned was to have each
>> CouchDB node remember all UUID's that it knows about and if a db is
>> opened with an unknown UUID, that db gets a new UUID assigned.
>>
>> This could be accomplished efficiently by storing _local docs in the
>> replicator database that reference known UUID/dbname pairs. Then we
>> just lookup the UUID on db open and if it doesn't match the db name
>> we reset it.
>>
>> For upgrade compatibility and the ability to change UUID's often we
>> could just store the UUID in the db header (as opposed to the first
>> sixteen bytes of the file).
>>
>> Information Propagation Requirements
>> ====================================
>>
>> When replication occurs we need to inform the target database of a
>> few pieces of information so that it knows about transitive replications
>> that it contains. We also need to make sure that the target db doesn't
>> learn about this information before it contains the entire replica set
>> and it needs to be processed in such a way that it doesn't require
>> complete replications.
>>
>> These requirements pretty much lead us to the fact that the replica
>> state will need to be beamed across as the target receives information
>> from the source update sequence. Ie, when we iterate the _changes feed
>> we get extra info when we've arrived an update_seq that wholly contains
>> some prior replication from an arbitrary node to the *source*.
>>
>> Information to Propagate
>> ========================
>>
>> Now we need to consider what information needs to exist on a db in
>> order to figure out if we *can* short circuit a replication as well as
>> where we fast forward *to*.
>>
>> One obvious piece of information is the UUID of the database stream. A
>> second piece would be the update_seq for that UUID. After some thought
>> we also realize we need to store some more information to check if that
>> UUID-update_seq pair is still valid when we go to fast-forward.
>>
>> The case that could invalidate a pair is if a database crashes and it
>> needs to be restored. Consider if A replicates to B replicates to C. C
>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>> thought experiment. Now at some point after C learns of A, A crashes and
>> is restored from backup. Now A is at update_seq 5. Now we go on with
>> our business and write 5 docs to A. But we also write 5 *different* docs
>> than we wrote before the restore. This divergence in history would not
>> be detectable without extra information.
>>
>> After much hand waving about rolling hashes, Adam decided to remember
>> that we store a replication history between two db's. This can be
>> represented as a _local doc id that includes information on the pair
>> of db's as well as a random session id. If we include this data with
>> the UUID-update_seq pair, when we check if a short circuit is possible
>> we can check that this record still exists.
>>
>> In the case of the crash/restore even if we go and make the same edits
>> and even have a similar replication history, the randomness to the
>> session id will inform us that something has gone awry and we need to
>> run a full replication to make sure we have all history.
>>
>>
>> Information Required to Trigger Propagation
>> ===========================================
>>
>> Along with the four pieces of information mentioned above, we also need
>> to store what update_seq in the target database was the *result* of a
>> replication. Ie, when we replicate A -> B, B needs to know the final
>> update_seq of that replication transaction. This is so that when B
>> replicates to C, it knows when to tell C about A. We can't do this at the
>> very beginning because the replication might fail before all of the
>> info from A is replicated. We also can't wait until the end because then
>> C may never learn of A because of failure.
>>
>> This means that we need to know for a given update_seq if after it has
>> been replicated, C can suddenly fast-forward a replication with someone
>> other than B. To do this B will need to be able to stream its update
>> sequence and efficiently check if that completes some replication record
>> that C should know about.
>>
>> We might quickly assume that storing this in the existing update seq
>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>> 6 on B is the end of the replication A -> B. Now consider that B starts
>> replicating to C while someone starts updating the doc for update_seq
>> 6 on B. Its possible that a series of events could lead to C never
>> learning of A because the update_seq for the doc id from 6 keeps jumping
>> to the latest update_seq.
>>
>> The proper way to fix this would be to insert code that says "when an
>> update_seq entry is updated, move its replication info to the next update
>> seq" which sounds like it could get really quite wonky.
>>
>> So the solution would be to have some sort of indexed structure of
>> replication records that can be scanned to know when to send out some
>> replication finished....
>>
>> Ruh Roh
>> =======
>>
>> I just realized something wonky with this whole plan. We *don't*
>> necessarily know when a replication ends because of update sequences. For
>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>> replicate B -> C, can we ever know when to short circuit a replication?
>>
>> This could be a huge gaping whole. Someone prove me wrong.
>>
>> Storing Replication State
>> =========================
>>
>> With this new piece of information we'll also require some way to store
>> replication state. This should hopefully be hand-wavy trivial by just
>> storing replication records in _local docs very similarly to how they're
>> currently stored.
>>
>> </brain_dump>
>>
>
> The important point of my ruh roh to realize that I failed to
> articulate, the reason that this is bad is that if when we edit the
> doc on B before replication to C, C *can't* know what's on A until it
> gets to the new version of the doc in B. This coupled with the fact
> that we can edit anything on B, and that they all jump to the end
> makes me think that we'd have to do some more extensive bookkeeping to
> make sure that C doesn't know about B until after all of A's docs get
> pushed.
>
> Blargghhh....
>

Doesn't know about A until all of A's docs get pushed. Its late. I'm out.

Re: idea for transitive replication checkpoints

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <pa...@gmail.com> wrote:
> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>>
>> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>>
>> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>>
>> On A
>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>>
>> On B
>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>
>> On C
>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>
>> We know that A -> B happened before B -> C.
>>
>> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>>
>> {A, _local/Foo, Bar, 5}
>>
>> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>>
>> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>>
>> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>>
>> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>>
>> Adam
>
> What Adam said. Also, I was just doing a brain dump and I think I
> might've punched a gaping whole into the whole scenario. I'm not
> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
> towards the end where my brain dump froze up. Its late so maybe I'm
> just not seeing the easy way around it.
>
> There's also a picture of the end of our white board session at
> http://plixi.com/p/78268064 which probably means little to nothing
> without the context of having seen it drawn and the ensuing argument
> and wild gesticulations. But its there for posterity.
>
> <brain_dump>
>
> Transitive Replication - The Idea
> =================================
>
> Consider the following scenario:
>
> 1. Replicate A -> B
> 2. Replicate B -> C
> 3. Replicate A -> C
>
> For simplicity's sake, assume no writes occur during this scenario. The
> question is why can't we short circuit step 3 to effectively be a no-op?
>
> Current Situation
> =================
>
> Replication state is based on a pair-wise state reflecting source and
> target information (and filter functions etc). For the above scenario to
> be anywhere near plausible a couple things need to happen. First, we'll
> obviously need to transfer data from B -> C during replication so it
> has knowledge about A. This information will have to be complete enough
> to short circuit (or skip part of) a replication from A.
>
> The information that B sends to C will need to enable a replication from
> A to C to occur without error in any sort of pathological state of A
> irregardless of what state C thinks A is in. Changes in state may include
> A "forgetting" some edits and resetting to a point in time the state
> that C has (for instance, A crashed and was recovered to a previous
> point in time).
>
> C will also need to be able to uniquely identify A regardless of host or
> other transitory characteristics.
>
> An Old Proposition
> ==================
>
> There's been a proposal floated a few times for a few different reasons
> to give each database a UUID so that it is uniquely identifiable for
> various reasons (ETags come to mind). Such a UUID were it to exist would
> allow us to uniquely identify a database in the above scenario.
>
> The first issue with db UUID's that always pops up is that we have to
> address the case of what happens when someone copies a database (perhaps
> to short circuit an initial replication, or restoring a db when a
> machine fails) is that the UUID may no longer be globally unique.
>
> This would need to be fixed for transitive replication to have any
> chance of working. One solution that was mentioned was to have each
> CouchDB node remember all UUID's that it knows about and if a db is
> opened with an unknown UUID, that db gets a new UUID assigned.
>
> This could be accomplished efficiently by storing _local docs in the
> replicator database that reference known UUID/dbname pairs. Then we
> just lookup the UUID on db open and if it doesn't match the db name
> we reset it.
>
> For upgrade compatibility and the ability to change UUID's often we
> could just store the UUID in the db header (as opposed to the first
> sixteen bytes of the file).
>
> Information Propagation Requirements
> ====================================
>
> When replication occurs we need to inform the target database of a
> few pieces of information so that it knows about transitive replications
> that it contains. We also need to make sure that the target db doesn't
> learn about this information before it contains the entire replica set
> and it needs to be processed in such a way that it doesn't require
> complete replications.
>
> These requirements pretty much lead us to the fact that the replica
> state will need to be beamed across as the target receives information
> from the source update sequence. Ie, when we iterate the _changes feed
> we get extra info when we've arrived an update_seq that wholly contains
> some prior replication from an arbitrary node to the *source*.
>
> Information to Propagate
> ========================
>
> Now we need to consider what information needs to exist on a db in
> order to figure out if we *can* short circuit a replication as well as
> where we fast forward *to*.
>
> One obvious piece of information is the UUID of the database stream. A
> second piece would be the update_seq for that UUID. After some thought
> we also realize we need to store some more information to check if that
> UUID-update_seq pair is still valid when we go to fast-forward.
>
> The case that could invalidate a pair is if a database crashes and it
> needs to be restored. Consider if A replicates to B replicates to C. C
> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
> thought experiment. Now at some point after C learns of A, A crashes and
> is restored from backup. Now A is at update_seq 5. Now we go on with
> our business and write 5 docs to A. But we also write 5 *different* docs
> than we wrote before the restore. This divergence in history would not
> be detectable without extra information.
>
> After much hand waving about rolling hashes, Adam decided to remember
> that we store a replication history between two db's. This can be
> represented as a _local doc id that includes information on the pair
> of db's as well as a random session id. If we include this data with
> the UUID-update_seq pair, when we check if a short circuit is possible
> we can check that this record still exists.
>
> In the case of the crash/restore even if we go and make the same edits
> and even have a similar replication history, the randomness to the
> session id will inform us that something has gone awry and we need to
> run a full replication to make sure we have all history.
>
>
> Information Required to Trigger Propagation
> ===========================================
>
> Along with the four pieces of information mentioned above, we also need
> to store what update_seq in the target database was the *result* of a
> replication. Ie, when we replicate A -> B, B needs to know the final
> update_seq of that replication transaction. This is so that when B
> replicates to C, it knows when to tell C about A. We can't do this at the
> very beginning because the replication might fail before all of the
> info from A is replicated. We also can't wait until the end because then
> C may never learn of A because of failure.
>
> This means that we need to know for a given update_seq if after it has
> been replicated, C can suddenly fast-forward a replication with someone
> other than B. To do this B will need to be able to stream its update
> sequence and efficiently check if that completes some replication record
> that C should know about.
>
> We might quickly assume that storing this in the existing update seq
> b+tree would be kosher, but it isn't. Consider the case where update_seq
> 6 on B is the end of the replication A -> B. Now consider that B starts
> replicating to C while someone starts updating the doc for update_seq
> 6 on B. Its possible that a series of events could lead to C never
> learning of A because the update_seq for the doc id from 6 keeps jumping
> to the latest update_seq.
>
> The proper way to fix this would be to insert code that says "when an
> update_seq entry is updated, move its replication info to the next update
> seq" which sounds like it could get really quite wonky.
>
> So the solution would be to have some sort of indexed structure of
> replication records that can be scanned to know when to send out some
> replication finished....
>
> Ruh Roh
> =======
>
> I just realized something wonky with this whole plan. We *don't*
> necessarily know when a replication ends because of update sequences. For
> instance, if we replicate A -> B, and then edit a doc from A on B, and then
> replicate B -> C, can we ever know when to short circuit a replication?
>
> This could be a huge gaping whole. Someone prove me wrong.
>
> Storing Replication State
> =========================
>
> With this new piece of information we'll also require some way to store
> replication state. This should hopefully be hand-wavy trivial by just
> storing replication records in _local docs very similarly to how they're
> currently stored.
>
> </brain_dump>
>

The important point of my ruh roh to realize that I failed to
articulate, the reason that this is bad is that if when we edit the
doc on B before replication to C, C *can't* know what's on A until it
gets to the new version of the doc in B. This coupled with the fact
that we can edit anything on B, and that they all jump to the end
makes me think that we'd have to do some more extensive bookkeeping to
make sure that C doesn't know about B until after all of A's docs get
pushed.

Blargghhh....

Re: idea for transitive replication checkpoints

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <ko...@apache.org> wrote:
> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for big networks of CouchDB servers can be substantial.
>
> The basic idea is that if A replicates with B, and B with C, then a new replication between A and C should not need to start from scratch.  I think we can accomplish this as follows:
>
> 1) Store the target update sequence along with the source sequence in the checkpoint document, at least in the checkpoint document on the target.  The following tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the following replication records:
>
> On A
> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
>
> On B
> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
> {B, _local/Baz, Bif, 15, _TargetSeq}
>
> On C
> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>
> We know that A -> B happened before B -> C.
>
> 2) During the B -> C replication, when we reach source sequence number 10, the _changes feed from B will deliver some extra information like
>
> {A, _local/Foo, Bar, 5}
>
> which will be stored at C. This may require a new disk-resident btree keyed on update sequence, or at least an in-memory index constructed by walking the _local docs btree.
>
> 3) When we trigger the A -> C replication, C will walk the full checkpoint records in its _local tree and find no mention of A, but then it will also consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that the session ID Bar is still present, and conclude that it can fast-forward the replication and start from update sequence 5.  It will then remove that transitive checkpoint and replace it with a full regular checkpoint.
>
> If server A crashes after the A -> B replication and restores from a backup that was recorded before the replication, the session ID Bar will be missing from _local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the correct behavior.
>
> Hopefully this is comprehensible to someone other than me.  We spent some time trying to poke holes in it, but it's entirely possible there are other things we didn't consider that will prevent it from working.  Cheers,
>
> Adam

What Adam said. Also, I was just doing a brain dump and I think I
might've punched a gaping whole into the whole scenario. I'm not
entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
towards the end where my brain dump froze up. Its late so maybe I'm
just not seeing the easy way around it.

There's also a picture of the end of our white board session at
http://plixi.com/p/78268064 which probably means little to nothing
without the context of having seen it drawn and the ensuing argument
and wild gesticulations. But its there for posterity.

<brain_dump>

Transitive Replication - The Idea
=================================

Consider the following scenario:

1. Replicate A -> B
2. Replicate B -> C
3. Replicate A -> C

For simplicity's sake, assume no writes occur during this scenario. The
question is why can't we short circuit step 3 to effectively be a no-op?

Current Situation
=================

Replication state is based on a pair-wise state reflecting source and
target information (and filter functions etc). For the above scenario to
be anywhere near plausible a couple things need to happen. First, we'll
obviously need to transfer data from B -> C during replication so it
has knowledge about A. This information will have to be complete enough
to short circuit (or skip part of) a replication from A.

The information that B sends to C will need to enable a replication from
A to C to occur without error in any sort of pathological state of A
irregardless of what state C thinks A is in. Changes in state may include
A "forgetting" some edits and resetting to a point in time the state
that C has (for instance, A crashed and was recovered to a previous
point in time).

C will also need to be able to uniquely identify A regardless of host or
other transitory characteristics.

An Old Proposition
==================

There's been a proposal floated a few times for a few different reasons
to give each database a UUID so that it is uniquely identifiable for
various reasons (ETags come to mind). Such a UUID were it to exist would
allow us to uniquely identify a database in the above scenario.

The first issue with db UUID's that always pops up is that we have to
address the case of what happens when someone copies a database (perhaps
to short circuit an initial replication, or restoring a db when a
machine fails) is that the UUID may no longer be globally unique.

This would need to be fixed for transitive replication to have any
chance of working. One solution that was mentioned was to have each
CouchDB node remember all UUID's that it knows about and if a db is
opened with an unknown UUID, that db gets a new UUID assigned.

This could be accomplished efficiently by storing _local docs in the
replicator database that reference known UUID/dbname pairs. Then we
just lookup the UUID on db open and if it doesn't match the db name
we reset it.

For upgrade compatibility and the ability to change UUID's often we
could just store the UUID in the db header (as opposed to the first
sixteen bytes of the file).

Information Propagation Requirements
====================================

When replication occurs we need to inform the target database of a
few pieces of information so that it knows about transitive replications
that it contains. We also need to make sure that the target db doesn't
learn about this information before it contains the entire replica set
and it needs to be processed in such a way that it doesn't require
complete replications.

These requirements pretty much lead us to the fact that the replica
state will need to be beamed across as the target receives information
from the source update sequence. Ie, when we iterate the _changes feed
we get extra info when we've arrived an update_seq that wholly contains
some prior replication from an arbitrary node to the *source*.

Information to Propagate
========================

Now we need to consider what information needs to exist on a db in
order to figure out if we *can* short circuit a replication as well as
where we fast forward *to*.

One obvious piece of information is the UUID of the database stream. A
second piece would be the update_seq for that UUID. After some thought
we also realize we need to store some more information to check if that
UUID-update_seq pair is still valid when we go to fast-forward.

The case that could invalidate a pair is if a database crashes and it
needs to be restored. Consider if A replicates to B replicates to C. C
has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
thought experiment. Now at some point after C learns of A, A crashes and
is restored from backup. Now A is at update_seq 5. Now we go on with
our business and write 5 docs to A. But we also write 5 *different* docs
than we wrote before the restore. This divergence in history would not
be detectable without extra information.

After much hand waving about rolling hashes, Adam decided to remember
that we store a replication history between two db's. This can be
represented as a _local doc id that includes information on the pair
of db's as well as a random session id. If we include this data with
the UUID-update_seq pair, when we check if a short circuit is possible
we can check that this record still exists.

In the case of the crash/restore even if we go and make the same edits
and even have a similar replication history, the randomness to the
session id will inform us that something has gone awry and we need to
run a full replication to make sure we have all history.


Information Required to Trigger Propagation
===========================================

Along with the four pieces of information mentioned above, we also need
to store what update_seq in the target database was the *result* of a
replication. Ie, when we replicate A -> B, B needs to know the final
update_seq of that replication transaction. This is so that when B
replicates to C, it knows when to tell C about A. We can't do this at the
very beginning because the replication might fail before all of the
info from A is replicated. We also can't wait until the end because then
C may never learn of A because of failure.

This means that we need to know for a given update_seq if after it has
been replicated, C can suddenly fast-forward a replication with someone
other than B. To do this B will need to be able to stream its update
sequence and efficiently check if that completes some replication record
that C should know about.

We might quickly assume that storing this in the existing update seq
b+tree would be kosher, but it isn't. Consider the case where update_seq
6 on B is the end of the replication A -> B. Now consider that B starts
replicating to C while someone starts updating the doc for update_seq
6 on B. Its possible that a series of events could lead to C never
learning of A because the update_seq for the doc id from 6 keeps jumping
to the latest update_seq.

The proper way to fix this would be to insert code that says "when an
update_seq entry is updated, move its replication info to the next update
seq" which sounds like it could get really quite wonky.

So the solution would be to have some sort of indexed structure of
replication records that can be scanned to know when to send out some
replication finished....

Ruh Roh
=======

I just realized something wonky with this whole plan. We *don't*
necessarily know when a replication ends because of update sequences. For
instance, if we replicate A -> B, and then edit a doc from A on B, and then
replicate B -> C, can we ever know when to short circuit a replication?

This could be a huge gaping whole. Someone prove me wrong.

Storing Replication State
=========================

With this new piece of information we'll also require some way to store
replication state. This should hopefully be hand-wavy trivial by just
storing replication records in _local docs very similarly to how they're
currently stored.

</brain_dump>