You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Randall Leeds <ra...@apache.org> on 2019/03/21 05:17:13 UTC
Re: _changes feed and replication optimizations

Adam, I think these ideas are good and it's valuable to have all this
context laid out. One of the great things has always been that replication
always works without any tweaking, but I think you're correctly identifying
common scenarios we could optimize for.

On Tue, Feb 19, 2019, 18:27 Adam Kocoloski <ko...@apache.org> wrote:

> (I suppose this is a [DISCUSS] - but isn’t that really the default for a
> *mailing list*? :)
>
> I’ve been having a few discussions with folks about how we might evolve to
> a more efficient replication protocol, and wanted to share some thoughts
> here. Apologies in advance, this one is kinda long.
>
> As a reminder, the current replication protocol goes something like this:
>
> 1. Read the _changes feed with the style=all_docs option to get a list of
> all leaf revisions for each document
> 2. Ask the target DB which revisions are missing for using the _revs_diff
> endpoint (this is a batch operation across multiple docs)
> 3. If any document has missing revisions on the target, retrieve all of
> those revisions (and their full history) in a multipart request per doc
> 4. Save the missing revisions on the target. Documents with attachments
> are saved individually using the multipart API, otherwise _bulk_docs is used
>
> I would posit that most replication jobs fall into one of two scenarios:
>
> Scenario A: First-time replication to a new database where we need to
> transfer *everything* — all revisions, and full paths for each
> Scenario B: Incremental replication; we likely only need to transfer the
> contents of the specific revision located at that seq in the feed, and a
> partial path is likely enough to merge the revision correctly into the
> target document.
>
> Our current protocol doesn’t quite handle either of those scenarios
> optimally, partly because it also wants to be efficient for the case where
> two peers have some overlapping set of edits that they’ve acquired by
> independent means and they want to avoid unnecessarily transferring
> document bodies. In practice I don’t think that happens very often. Here
> are some ideas for improvements aimed at making those two scenarios above
> more efficient. The summarized list I have in mind includes three “safe”
> optimizations:
>
> - Implement a shard-level _changes interface
> - Identify the document revision(s) actually associated with each sequence
> in the feed, and include their bodies and revision histories
> - Provide the output of _revs_diff in the response to a new_edits=false
> operation
>
> and a somewhat more exotic attempt at saving on bandwidth for replication
> of frequently-edited documents:
>
> - Include a configurable amount of revision history
> - Have the target DB report whether it was able to extend existing
> revision paths when saving with new_edits=false
>
> I’ll go into more detail on each of those below.
>
> ## Implement a shard-level _changes interface
>
> This one is applicable to both scenarios. The source cluster spends a lot
> of CPU cycles merging the feeds from individual shards and (especially)
> computing a cluster-wide sequence. This sequence obscures the actual shard
> that contributed the row in question, it doesn’t sort very cleanly, uses a
> boatload of bandwidth, and cannot be easily decoded by consumers. In
> general, yuck.
>
> Some applications will continue to benefit from having a single URL that
> provides the full database’s list of changes, but many consumers can deal
> quite well with a set of feeds to represent a database. Messaging systems
> like Kafka can be taught that a particular source is sharded, and each of
> the individual feeds has its own ordered sequence. Consuming feeds this way
> makes the dreaded “rewind” that can occur when a node fails a more obvious
> event, and an easier one for the external system to handle.
>
> Over the years we’ve grown to be quite careful about keeping enough
> metadata to ensure that we can uniquely identify a particular commit in the
> database. The full list is
>
> - shard range
> - file creation UUID
> - sequence number
> - node that owned the file for this “epoch” (a range of sequence numbers)
>
> The file creation UUID is important because a shard file might be
> destroyed on a node (hardware failure, operator error, etc.) and then
> rebuilt via internal replication, in which case the sequence order is not
> preserved. A slightly less robust alternative is the file creation
> timestamp embedded in the filename. The “epoch node” is important because
> operators have been known to copy database shard files around the cluster
> from the time to time, and it’s thus possible for two different copies of a
> shard to actually share the not-so-unique UUID for a range of their
> sequences.
>
> I think that a shard-level _changes feed would do well to surface these
> bits of metadata directly, perhaps as properties in the JSON row. If we
> were concerned about bandwidth we could only emit those properties in a row
> when they changed compared to the previous sequence.
>
> ## Identify the document revision(s) actually associated with each
> sequence in the feed, and include their bodies and revision histories
>
> I had to go spelunking in the codebase to check on the behavior here, and
> I was unpleasantly surprised to find that the _changes feed always includes
> the “winning” revision as the main revision in the row, *not* the
> revision(s) actually recorded into the database at this sequence. The leafs
> of the revision tree do record the sequence at which they were committed so
> we do have this information.
>
> The reason we want to identify the revisions committed in this sequence is
> because those are the ones where we have the highest probability of needing
> to replicate the document body. The probability is so high that I think we
> should be proactively including them in the _changes feed. We can save
> quite a few lookups this way.
>
> I don’t want to be an alarmist here — 99% of the time the revision
> committed at a particular sequence *is* the winning revision. But it’s not
> required to be the case, and for replication purposes the winner is not the
> best version to include.
>
> Of course, there are other consumers of the _changes feed besides
> replication who may not be at all interested in edits that are not on the
> winning branch, so we likely need a flag that controls this behavior. An
> analytics system, for example, could use the high-performance shard-level
> feeds to grab a full snapshot of the database, and would want to receive
> just the winning revision of each document.
>
> Finally, if we know we’re in Scenario A we would want to receive the
> bodies of *all* leaf revisions ASAP, so that’s yet a third option.
>
> ## Provide the output of _revs_diff in the response to a new_edits=false
> operation
>
> I think this one is relatively simple. We’re confident that the update(s)
> which just landed in our _changes feed need to be replicated, so we eagerly
> include the body and history in a new_edits=false request. We also include
> the other leaf revisions for the document when we do this, and the server
> checks to see if any of those revisions are missing as well, and includes
> that information in the response. If there are additional revisions
> required, we have to go back and fetch them.
>
> Now, onto the more speculative stuff ...
>
> ## Include a configurable amount of revision history
>
> If we’re in Scenario A, we want the whole revision history. In Scenario B,
> it’s often quite wasteful to include the full path (which can be > 16KB for
> a frequently-edited document using the default _revs_limit) when it’s
> likely that just the last few entries will suffice to merge the edit into
> the target. A smart replicator could usually guess which scenario it’s most
> likely to find itself in using some basic heuristics, but I’m curious if
> others have creative ideas of how to handle this.
>
> ## Have the target DB report whether it was able to extend existing
> revision paths when saving with new_edits=false
>
> If we request just a recent portion of the revision history, there will be
> times when we don’t have enough history and we accidentally create a
> spurious edit conflict on the target. Of course, this is exactly what
> happens if _revs_limit is set too low on the source, except in this
> circumstance we have a chance to correct things because we still have
> additional revision metadata sitting on the source that we omitted to
> conserve bandwidth. If the target server can let us know that a document
> revision could not be grafted onto an existing edit branch (and it started
> with a rev position > 1), then the replicator might try again, this time
> requesting the full revision history from the source and eliminating the
> spurious conflict.
>
> … So that’s what I’ve got for now. Hopefully it makes sense. I haven’t
> tried to write down exact APIs yet but would save that for the RFC(s) that
> might come out of the discussion. Cheers,
>
> Adam
>
>