You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2019/03/06 00:04:10 UTC

Re: [DISCUSS] : things we need to solve/decide : changes feed

Dredging this thread back up with an eye towards moving to an RFC …

I was reading through the FoundationDB Record Layer preprint[1] a few weeks ago and noticed an enhancement to their version of _changes that I know would be beneficial to IBM and that I think is worth considering for inclusion in CouchDB directly. Quoting the paper:

> To implement a sync index, CloudKit leverages the total order on FoundationDB’s commit versions by using a VERSION index, mapping versions to record identifiers. To perform a sync, CloudKit simply scans the VERSION index.
> 
> However, commit versions assigned by different FoundationDB clusters are uncorrelated. This introduces a challenge when migrating data from one cluster to another; CloudKit periodically moves users to improve load balance and locality. The sync index must represent the order of updates across all clusters, so updates committed after the move must be sorted after updates committed before the move. CloudKit addresses this with an application-level per-user count of the number of moves, called the incarnation. Initially, the incarnation is 1, and CloudKit increments it each time the user’s data is moved to a different cluster. On every record update, we write the user’s current incarnation to the record’s header; these values are not modified during a move. The VERSION sync index maps (incarnation, version) pairs to changed records, sorting the changes first by incarnation, then by version.

One of our goals in adopting FoundationDB is to eliminate rewinds of the _changes feed; we make significant progress towards that goal simply by adopting FoundationDB versionstamps as sequence identifiers, but in cases where user data might be migrated from one FoundationDB cluster to another we can lose this total ordering and rewind (or worse, possibly skip updates). The “incarnation” trick of prefixing the versionstamp with an integer which gets bumped whenever a user is moved is a good way to mitigate that. I’ll give some thought to how the per-database incarnation can be recorded and what facility we might have for intelligently bumping it automatically, but I wanted to bring this to folks’ attention and resurrect this ML thread.

Another thought I had this evening is to record the number of edit branches for a given document in the value of the index. The reason I’d do this is to optimize the popular `style=all_docs` queries to _changes to avoid an extra range read in the very common case where a document has only a single edit branch.

With the incarnation and branch count in place we’d be looking at a design where the KV pairs have the structure

(“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, RevPosition, RevHash, BranchCount)

where ValFormat is an enumeration enabling schema evolution of the value format in the future, and RevFormat, RevPosition, RevHash are associated with the winning edit branch for the document (not necessarily the edit that occurred at this version, matching current CouchDB behavior) and carry the meanings defined in the revision storage RFC[2].

A regular _changes feed request can respond simply by scanning this index. A style=all_docs request can also be a simple scan if BranchCount is 1; if it’s greater than 1 we would need to do an additional range read of the “revisions” subspace to retrieve the leaf revision identifiers for the document in question. An include_docs=true request would need to do an additional range read in the document storage subspace for this revision.

I think both the incarnation and the branch count warrant a small update to the revision metadata RFC …

Adam

[1]: https://www.foundationdb.org/files/record-layer-paper.pdf
[2]: https://github.com/apache/couchdb-documentation/pull/397


> On Feb 5, 2019, at 12:20 PM, Mike Rhodes <co...@dx13.co.uk> wrote:
> 
> Solution (2) appeals to me for its conceptual simplicity -- and having a stateless CouchDB layer I feel is super important in simplifying overall CouchDB deployment going forward.
> 
> -- 
> Mike.
> 
> On Mon, 4 Feb 2019, at 20:11, Adam Kocoloski wrote:
>> Probably good to take a quick step back and note that FoundationDB’s 
>> versionstamps are an elegant and scalable solution to atomically 
>> maintaining the index of documents in the order in which they were most 
>> recently updated. I think that’s what you mean by the first part of the 
>> problem, but I want to make sure that on the ML here we collectively 
>> understand that FoundationDB actually nails this hard part of the 
>> problem *really* well.
>> 
>> When you say “notify CouchDB about new updates”, are you referring to 
>> the feed=longpoll or feed=continuous options to the _changes API? I 
>> guess I see  three different routes that can be taken here.
>> 
>> One route is to use the same kind machinery that we have in place today 
>> in CouchDB 2.x. As a reminder, the way this works is
>> 
>> -  a client waiting for changes on a DB spawns one local process and 
>> also a rexi RPC process on each node hosting one of the DB shards of 
>> interest (see fabric_db_update_listener). 
>> - those RPC processes register as local couch_event listeners, where 
>> they receive {db_updated, ShardName} messages forwarded to them from 
>> the couch_db_updater processes.
>> 
>> Of course, in the FoundationDB design we don’t need to serialize 
>> updates in couch_db_updater processes, but individual writers could 
>> just as easily fire off those db_updated messages. This design is 
>> already heavily optimized for large numbers of listeners on large 
>> numbers of databases. The downside that I can see is it means the 
>> *CouchDB layer nodes would need to form a distributed Erlang cluster* 
>> in order for them to learn about the changes being committed from other 
>> nodes in the cluster.
>> 
>> So let’s say we *didn’t* want to do that and we rather are trying to 
>> design for completely independent layer nodes that have no knowledge of 
>> or communication with one another save through FoundationDB. There’s 
>> definitely something to be said for that constraint. One very simple 
>> approach might be to just poll FoundationDB. If the database is under a 
>> heavy write load there’s no overhead to this approach; every time we 
>> finish sending the output of one range query against the versionstamp 
>> space and we re-acquire a new read version there will be new updates to 
>> consume. Where it gets inefficient is if we have a lot of listeners on 
>> the feed and a very low-throughput database. But one fiddle with 
>> polling intervals, or else have a layer of indirection so only one 
>> process on each layer node is doing the polling and then sending events 
>> to couch_event. I think this could scale quite far.
>> 
>> The other option (which I think is the one you’re homing in on) is to 
>> leverage FoundationDB’s watchers to get a push notification about 
>> updates to a particular key. I would be cautious about creating a 
>> specific key or set of keys specifically for this purpose, but, if we 
>> find that there’s some other bit of metadata that we are needing to 
>> maintain anyway then this could work nicely. I think same indirection 
>> that I described above (where each layer node has a maximum of one 
>> watcher per database, and it re-broadcasts messages to all interested 
>> clients via couch_event) would help us not be too constrained by the 
>> limit on watches.
>> 
>> So to recap, the three approaches
>> 
>> 1. Writers publish db_updated events to couch_event, listeners use 
>> distributed Erlang to subscribe to all nodes
>> 2. Poll the _changes subspace, scale by nominating a specific process 
>> per node to do the polling
>> 3. Same as #2 but using a watch on DB metadata that changes with every 
>> update instead of polling
>> 
>> Adam
>> 
>>> On Feb 4, 2019, at 2:18 PM, Ilya Khlopotov <ii...@apache.org> wrote:
>>> 
>>> One of the features of CouchDB, which doesn't map cleanly into FoudationDB is changes feed. The essence of the feature is: 
>>> - Subscriber of the feed wants to receive notifications when database is updated. 
>>> - The notification includes update_seq for the database and list of changes which happen at that time. 
>>> - The change itself includes docid and rev. 
>>> Hi, 
>>> 
>>> There are multiple ways to easily solve this problem. Designing a scalable way to do it is way harder.  
>>> 
>>> There are at least two parts to this problem:
>>> - how to structure secondary indexes so we can provide what we need in notification event
>>> - how to notify CouchDB about new updates
>>> 
>>> For the second part of the problem we could setup a watcher on one of the keys we have to update on every transaction. For example the key which tracks the database_size or key which tracks the number of documents or we can add our own key. The problem is at some point we would hit a capacity limit for atomic updates of a single key (FoundationDB doesn't redistribute the load among servers on per key basis). In such case we would have to distribute the counter among multiple keys to allow FoundationDB to split the hot range. Therefore, we would have to setup multiple watches. FoundationDB has a limit on the number of watches the client can setup (100000 watches). So we need to keep in mind this number when designing the feature. 
>>> 
>>> The single key update rate problem is very theoretical and we might ignore it for the PoC version. Then we can measure the impact and change design accordingly. The reason I decided to bring it up is to see maybe someone has a simple solution to avoid the bottleneck. 
>>> 
>>> best regards,
>>> iilyak
>> 
>> 


Re: [DISCUSS] : things we need to solve/decide : changes feed

Posted by Robert Newson <rn...@apache.org>.
+1 to both changes, will echo that in the PR.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Wed, 6 Mar 2019, at 00:04, Adam Kocoloski wrote:
> Dredging this thread back up with an eye towards moving to an RFC …
> 
> I was reading through the FoundationDB Record Layer preprint[1] a few 
> weeks ago and noticed an enhancement to their version of _changes that 
> I know would be beneficial to IBM and that I think is worth considering 
> for inclusion in CouchDB directly. Quoting the paper:
> 
> > To implement a sync index, CloudKit leverages the total order on FoundationDB’s commit versions by using a VERSION index, mapping versions to record identifiers. To perform a sync, CloudKit simply scans the VERSION index.
> > 
> > However, commit versions assigned by different FoundationDB clusters are uncorrelated. This introduces a challenge when migrating data from one cluster to another; CloudKit periodically moves users to improve load balance and locality. The sync index must represent the order of updates across all clusters, so updates committed after the move must be sorted after updates committed before the move. CloudKit addresses this with an application-level per-user count of the number of moves, called the incarnation. Initially, the incarnation is 1, and CloudKit increments it each time the user’s data is moved to a different cluster. On every record update, we write the user’s current incarnation to the record’s header; these values are not modified during a move. The VERSION sync index maps (incarnation, version) pairs to changed records, sorting the changes first by incarnation, then by version.
> 
> One of our goals in adopting FoundationDB is to eliminate rewinds of 
> the _changes feed; we make significant progress towards that goal 
> simply by adopting FoundationDB versionstamps as sequence identifiers, 
> but in cases where user data might be migrated from one FoundationDB 
> cluster to another we can lose this total ordering and rewind (or 
> worse, possibly skip updates). The “incarnation” trick of prefixing the 
> versionstamp with an integer which gets bumped whenever a user is moved 
> is a good way to mitigate that. I’ll give some thought to how the 
> per-database incarnation can be recorded and what facility we might 
> have for intelligently bumping it automatically, but I wanted to bring 
> this to folks’ attention and resurrect this ML thread.
> 
> Another thought I had this evening is to record the number of edit 
> branches for a given document in the value of the index. The reason I’d 
> do this is to optimize the popular `style=all_docs` queries to _changes 
> to avoid an extra range read in the very common case where a document 
> has only a single edit branch.
> 
> With the incarnation and branch count in place we’d be looking at a 
> design where the KV pairs have the structure
> 
> (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, 
> RevPosition, RevHash, BranchCount)
> 
> where ValFormat is an enumeration enabling schema evolution of the 
> value format in the future, and RevFormat, RevPosition, RevHash are 
> associated with the winning edit branch for the document (not 
> necessarily the edit that occurred at this version, matching current 
> CouchDB behavior) and carry the meanings defined in the revision 
> storage RFC[2].
> 
> A regular _changes feed request can respond simply by scanning this 
> index. A style=all_docs request can also be a simple scan if 
> BranchCount is 1; if it’s greater than 1 we would need to do an 
> additional range read of the “revisions” subspace to retrieve the leaf 
> revision identifiers for the document in question. An include_docs=true 
> request would need to do an additional range read in the document 
> storage subspace for this revision.
> 
> I think both the incarnation and the branch count warrant a small 
> update to the revision metadata RFC …
> 
> Adam
> 
> [1]: https://www.foundationdb.org/files/record-layer-paper.pdf
> [2]: https://github.com/apache/couchdb-documentation/pull/397
> 
> 
> > On Feb 5, 2019, at 12:20 PM, Mike Rhodes <co...@dx13.co.uk> wrote:
> > 
> > Solution (2) appeals to me for its conceptual simplicity -- and having a stateless CouchDB layer I feel is super important in simplifying overall CouchDB deployment going forward.
> > 
> > -- 
> > Mike.
> > 
> > On Mon, 4 Feb 2019, at 20:11, Adam Kocoloski wrote:
> >> Probably good to take a quick step back and note that FoundationDB’s 
> >> versionstamps are an elegant and scalable solution to atomically 
> >> maintaining the index of documents in the order in which they were most 
> >> recently updated. I think that’s what you mean by the first part of the 
> >> problem, but I want to make sure that on the ML here we collectively 
> >> understand that FoundationDB actually nails this hard part of the 
> >> problem *really* well.
> >> 
> >> When you say “notify CouchDB about new updates”, are you referring to 
> >> the feed=longpoll or feed=continuous options to the _changes API? I 
> >> guess I see  three different routes that can be taken here.
> >> 
> >> One route is to use the same kind machinery that we have in place today 
> >> in CouchDB 2.x. As a reminder, the way this works is
> >> 
> >> -  a client waiting for changes on a DB spawns one local process and 
> >> also a rexi RPC process on each node hosting one of the DB shards of 
> >> interest (see fabric_db_update_listener). 
> >> - those RPC processes register as local couch_event listeners, where 
> >> they receive {db_updated, ShardName} messages forwarded to them from 
> >> the couch_db_updater processes.
> >> 
> >> Of course, in the FoundationDB design we don’t need to serialize 
> >> updates in couch_db_updater processes, but individual writers could 
> >> just as easily fire off those db_updated messages. This design is 
> >> already heavily optimized for large numbers of listeners on large 
> >> numbers of databases. The downside that I can see is it means the 
> >> *CouchDB layer nodes would need to form a distributed Erlang cluster* 
> >> in order for them to learn about the changes being committed from other 
> >> nodes in the cluster.
> >> 
> >> So let’s say we *didn’t* want to do that and we rather are trying to 
> >> design for completely independent layer nodes that have no knowledge of 
> >> or communication with one another save through FoundationDB. There’s 
> >> definitely something to be said for that constraint. One very simple 
> >> approach might be to just poll FoundationDB. If the database is under a 
> >> heavy write load there’s no overhead to this approach; every time we 
> >> finish sending the output of one range query against the versionstamp 
> >> space and we re-acquire a new read version there will be new updates to 
> >> consume. Where it gets inefficient is if we have a lot of listeners on 
> >> the feed and a very low-throughput database. But one fiddle with 
> >> polling intervals, or else have a layer of indirection so only one 
> >> process on each layer node is doing the polling and then sending events 
> >> to couch_event. I think this could scale quite far.
> >> 
> >> The other option (which I think is the one you’re homing in on) is to 
> >> leverage FoundationDB’s watchers to get a push notification about 
> >> updates to a particular key. I would be cautious about creating a 
> >> specific key or set of keys specifically for this purpose, but, if we 
> >> find that there’s some other bit of metadata that we are needing to 
> >> maintain anyway then this could work nicely. I think same indirection 
> >> that I described above (where each layer node has a maximum of one 
> >> watcher per database, and it re-broadcasts messages to all interested 
> >> clients via couch_event) would help us not be too constrained by the 
> >> limit on watches.
> >> 
> >> So to recap, the three approaches
> >> 
> >> 1. Writers publish db_updated events to couch_event, listeners use 
> >> distributed Erlang to subscribe to all nodes
> >> 2. Poll the _changes subspace, scale by nominating a specific process 
> >> per node to do the polling
> >> 3. Same as #2 but using a watch on DB metadata that changes with every 
> >> update instead of polling
> >> 
> >> Adam
> >> 
> >>> On Feb 4, 2019, at 2:18 PM, Ilya Khlopotov <ii...@apache.org> wrote:
> >>> 
> >>> One of the features of CouchDB, which doesn't map cleanly into FoudationDB is changes feed. The essence of the feature is: 
> >>> - Subscriber of the feed wants to receive notifications when database is updated. 
> >>> - The notification includes update_seq for the database and list of changes which happen at that time. 
> >>> - The change itself includes docid and rev. 
> >>> Hi, 
> >>> 
> >>> There are multiple ways to easily solve this problem. Designing a scalable way to do it is way harder.  
> >>> 
> >>> There are at least two parts to this problem:
> >>> - how to structure secondary indexes so we can provide what we need in notification event
> >>> - how to notify CouchDB about new updates
> >>> 
> >>> For the second part of the problem we could setup a watcher on one of the keys we have to update on every transaction. For example the key which tracks the database_size or key which tracks the number of documents or we can add our own key. The problem is at some point we would hit a capacity limit for atomic updates of a single key (FoundationDB doesn't redistribute the load among servers on per key basis). In such case we would have to distribute the counter among multiple keys to allow FoundationDB to split the hot range. Therefore, we would have to setup multiple watches. FoundationDB has a limit on the number of watches the client can setup (100000 watches). So we need to keep in mind this number when designing the feature. 
> >>> 
> >>> The single key update rate problem is very theoretical and we might ignore it for the PoC version. Then we can measure the impact and change design accordingly. The reason I decided to bring it up is to see maybe someone has a simple solution to avoid the bottleneck. 
> >>> 
> >>> best regards,
> >>> iilyak
> >> 
> >> 
> 
>

Re: [DISCUSS] : things we need to solve/decide : changes feed

Posted by Adam Kocoloski <ko...@apache.org>.
FYI I moved this to an RFC at https://github.com/apache/couchdb-documentation/pull/401

Adam

> On Mar 18, 2019, at 10:47 PM, Adam Kocoloski <ko...@apache.org> wrote:
> 
> 
>> On Mar 18, 2019, at 9:03 PM, Alex Miller <al...@apple.com.INVALID> wrote:
>> 
>> 
>>> On Mar 5, 2019, at 4:04 PM, Adam Kocoloski <ko...@apache.org> wrote:
>>> With the incarnation and branch count in place we’d be looking at a design where the KV pairs have the structure
>>> 
>>> (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, RevPosition, RevHash, BranchCount)
>>> 
>>> where ValFormat is an enumeration enabling schema evolution of the value format in the future, and RevFormat, RevPosition, RevHash are associated with the winning edit branch for the document (not necessarily the edit that occurred at this version, matching current CouchDB behavior) and carry the meanings defined in the revision storage RFC[2].
>> 
>> 
>> 
>> Do note that with versionstamped keys, and atomic operations in general, it’s important to keep in mind that committing a transaction might return `commit_unknown_result`.  Transaction loops will retry a `commit_unknown_result` error by default.  (Or, will, if your erlang/elixer bindings copy the behavior of the rest of the bindings.)  So you’ll need some way of making an insert into `changes` an idempotent operation.
>> 
>> 
>> I’ll volunteer three possible options:
>> 
>> 1. The easiest case is if you happen to be inserting a known, fixed key (and preferably one that contains a versionstamped value) in the same transaction as a versionstamped key, as then you have a key to check in your database to tell if your commit happened or not.
>> 
>> 2. If you’re doing an insert of just this key in a transaction, and your key space has relatively infrequent writes, then you might be able to get away with remembering the initial read version of your transaction, and issue a range scan from (“changes”, Incarnation, InitiailReadVersion) -> (“changes”, infinity, infinity), and filter through looking for a value equal to what you tried to write.
>> 
>> 3. Accept that you might write duplicate values at different versionstamped keys, and write your client code such that it will skip repeated values that it has already seen.
>> 
>> I had filed an internal bug long ago to complain about this before, which I’ve now copied over to GitHub[1].  So if this becomes absurdly difficult to work around, feel free to show up there to complain.
>> 
>> [1]: https://github.com/apple/foundationdb/issues/1321 <https://github.com/apple/foundationdb/issues/1321>
> 
> Hi Alex, thanks for that comment and for taking a close read. Option 1 could almost work here; we will be inserting up to two keys in a “revisions” subspace as part of the same transaction that we could read and that would include both the RevHash and the Versionstamp. The latest design for that subspace is here:
> 
> https://github.com/apache/couchdb-documentation/blob/5197cdffe1e2c08a7640dd646dd02909c0cf51ef/rfcs/001-fdb-revision-metadata-model.md
> 
> If I understand correctly, I think the edge case regarding `commit_unknown_result` that we’re not adequately guarding against is the following series of events:
> 
> 1) Txn A tries to commit an edit and gets `commit_unknown_result`; in reality, the transaction failed
> 2) Txn B tries to commit an *identical* edit (save for the versionstamp) and succeeds
> 3) Txn A retries and finds the entry in “revisions” for this `RevHash` exists and the `Versionstamp` in “changes” for this DocID higher than the one initially attempted
> 
> In this scenario we should report an edit conflict failure back to the client for Txn A, but the end result is indistinguishable from the case where 
> 
> 1) Txn A tries to commit an edit and gets `commit_unknown_result`; in reality, the transaction *succeeds*
> 2) Txn B tries to edit a *different* branch of the document and succeeds (thereby replacing Txn A’s entry in “changes”)
> 
> which is a scenario where we need to report success for both Txn A and Txn B.
> 
> We could close this loophole by storing the Versionstamp alongside the RevHash for every edit in the “revisions” subspace, rather than only storing the Versionstamp of the latest edit to the document. Not cheap though. Will give it some thought. Thanks!
> 
> Adam
> 


Re: [DISCUSS] : things we need to solve/decide : changes feed

Posted by Adam Kocoloski <ko...@apache.org>.
> On Mar 18, 2019, at 9:03 PM, Alex Miller <al...@apple.com.INVALID> wrote:
> 
> 
>> On Mar 5, 2019, at 4:04 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> With the incarnation and branch count in place we’d be looking at a design where the KV pairs have the structure
>> 
>> (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, RevPosition, RevHash, BranchCount)
>> 
>> where ValFormat is an enumeration enabling schema evolution of the value format in the future, and RevFormat, RevPosition, RevHash are associated with the winning edit branch for the document (not necessarily the edit that occurred at this version, matching current CouchDB behavior) and carry the meanings defined in the revision storage RFC[2].
> 
> 
> 
> Do note that with versionstamped keys, and atomic operations in general, it’s important to keep in mind that committing a transaction might return `commit_unknown_result`.  Transaction loops will retry a `commit_unknown_result` error by default.  (Or, will, if your erlang/elixer bindings copy the behavior of the rest of the bindings.)  So you’ll need some way of making an insert into `changes` an idempotent operation.
> 
> 
> I’ll volunteer three possible options:
> 
> 1. The easiest case is if you happen to be inserting a known, fixed key (and preferably one that contains a versionstamped value) in the same transaction as a versionstamped key, as then you have a key to check in your database to tell if your commit happened or not.
> 
> 2. If you’re doing an insert of just this key in a transaction, and your key space has relatively infrequent writes, then you might be able to get away with remembering the initial read version of your transaction, and issue a range scan from (“changes”, Incarnation, InitiailReadVersion) -> (“changes”, infinity, infinity), and filter through looking for a value equal to what you tried to write.
> 
> 3. Accept that you might write duplicate values at different versionstamped keys, and write your client code such that it will skip repeated values that it has already seen.
> 
> I had filed an internal bug long ago to complain about this before, which I’ve now copied over to GitHub[1].  So if this becomes absurdly difficult to work around, feel free to show up there to complain.
> 
> [1]: https://github.com/apple/foundationdb/issues/1321 <https://github.com/apple/foundationdb/issues/1321>

Hi Alex, thanks for that comment and for taking a close read. Option 1 could almost work here; we will be inserting up to two keys in a “revisions” subspace as part of the same transaction that we could read and that would include both the RevHash and the Versionstamp. The latest design for that subspace is here:

https://github.com/apache/couchdb-documentation/blob/5197cdffe1e2c08a7640dd646dd02909c0cf51ef/rfcs/001-fdb-revision-metadata-model.md

If I understand correctly, I think the edge case regarding `commit_unknown_result` that we’re not adequately guarding against is the following series of events:

1) Txn A tries to commit an edit and gets `commit_unknown_result`; in reality, the transaction failed
2) Txn B tries to commit an *identical* edit (save for the versionstamp) and succeeds
3) Txn A retries and finds the entry in “revisions” for this `RevHash` exists and the `Versionstamp` in “changes” for this DocID higher than the one initially attempted

In this scenario we should report an edit conflict failure back to the client for Txn A, but the end result is indistinguishable from the case where 

1) Txn A tries to commit an edit and gets `commit_unknown_result`; in reality, the transaction *succeeds*
2) Txn B tries to edit a *different* branch of the document and succeeds (thereby replacing Txn A’s entry in “changes”)

which is a scenario where we need to report success for both Txn A and Txn B.

We could close this loophole by storing the Versionstamp alongside the RevHash for every edit in the “revisions” subspace, rather than only storing the Versionstamp of the latest edit to the document. Not cheap though. Will give it some thought. Thanks!

Adam


Re: [DISCUSS] : things we need to solve/decide : changes feed

Posted by Alex Miller <al...@apple.com.INVALID>.
> On Mar 5, 2019, at 4:04 PM, Adam Kocoloski <ko...@apache.org> wrote:
> With the incarnation and branch count in place we’d be looking at a design where the KV pairs have the structure
> 
> (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, RevPosition, RevHash, BranchCount)
> 
> where ValFormat is an enumeration enabling schema evolution of the value format in the future, and RevFormat, RevPosition, RevHash are associated with the winning edit branch for the document (not necessarily the edit that occurred at this version, matching current CouchDB behavior) and carry the meanings defined in the revision storage RFC[2].



Do note that with versionstamped keys, and atomic operations in general, it’s important to keep in mind that committing a transaction might return `commit_unknown_result`.  Transaction loops will retry a `commit_unknown_result` error by default.  (Or, will, if your erlang/elixer bindings copy the behavior of the rest of the bindings.)  So you’ll need some way of making an insert into `changes` an idempotent operation.


I’ll volunteer three possible options:

1. The easiest case is if you happen to be inserting a known, fixed key (and preferably one that contains a versionstamped value) in the same transaction as a versionstamped key, as then you have a key to check in your database to tell if your commit happened or not.

2. If you’re doing an insert of just this key in a transaction, and your key space has relatively infrequent writes, then you might be able to get away with remembering the initial read version of your transaction, and issue a range scan from (“changes”, Incarnation, InitiailReadVersion) -> (“changes”, infinity, infinity), and filter through looking for a value equal to what you tried to write.

3. Accept that you might write duplicate values at different versionstamped keys, and write your client code such that it will skip repeated values that it has already seen.

I had filed an internal bug long ago to complain about this before, which I’ve now copied over to GitHub[1].  So if this becomes absurdly difficult to work around, feel free to show up there to complain.

[1]: https://github.com/apple/foundationdb/issues/1321 <https://github.com/apple/foundationdb/issues/1321>