You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2019/03/07 01:15:39 UTC

[DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Hi all, as the project devs are working through the design for the _changes feed in FoundationDB we’ve come across a limitation that is worth discussing with the broader user community. FoundationDB currently imposes a 5 second limit on all transactions, and read versions from old transactions are inaccessible after that window. This means that, unlike a single CouchDB storage shard, it is not possible to grab a long-lived snapshot of the entire database.

In extant versions of CouchDB we rely on this long-lived snapshot behavior for a number of operations, some of which are user-facing. For example, it is possible to make a request to the _changes feed for a database of an arbitrary size and, if you’ve got the storage space and time to spare, you can pull down a snapshot of the entire database in a single request. That snapshot will contain exactly one entry for each document in the database. In CouchDB 1.x the documents appear in the order in which they were most recently updated. In CouchDB 2.x there is no guaranteed ordering, although in practice the documents are roughly ordered by most recent edit. Note that you really do have to complete the operation in a single HTTP request; if you chunk up the requests or have to retry because the connection was severed then the exactly-once guarantees disappear.

We have a couple of different options for how we can implement _changes with FoundationDB as a backing store, I’ll describe them below and discuss the tradeoffs

## Option A: Single Version Index, long-running operations as multiple transactions

In this option the internal index has exactly one entry for each document at all times. A _changes request that cannot be satisfied within the 5 second limit will be implemented as multiple FoundationDB transactions under the covers. These transactions will have different read versions, and a document that gets updated in between those read versions will show up *multiple times* in the response body. The entire feed will be totally ordered, and later occurrences of a particular document are guaranteed to represent more recent edits than than the earlier occurrences. In effect, it’s rather like the semantics of a feed=continuous request today, but with much better ordering and zero possibility of “rewinds” where large portions of the ID space get replayed because of issues in the cluster.

This option is very efficient internally and does not require any background maintenance. A future enhancement in FoundationDB’s storage engine is designed to enable longer-running read-only transactions, so we will likely to be able to improve the semantics with this option over time.

## Option B: Multi-Version Index

In this design the internal index can contain multiple entries for a given document. Each entry includes the sequence at which the document edit was made, and may also include a sequence at which it was overwritten by a more recent edit.

The implementation of a _changes request would start by getting the current version of the datastore (call this the read version), and then as it examines entries in the index it would skip over any entries where there’s a “tombstone” sequence less than the read version. Crucially, if the request needs to be implemented across multiple transactions, each transaction would use the same read version when deciding whether to include entries in the index in the _changes response. The readers would know to stop when and if they encounter an entry where the created version is greater than the read version. Perhaps a diagram helps to clarify, a simplified version of the internal index might look like

{“seq”: 1, “id”: ”foo”}
{“seq”: 2, “id”: ”bar”, “tombstone”: 5}
{“seq”: 3, “id”: “baz”}
{“seq”: 4, “id”: “bif”, “tombstone": 6}
{“seq”: 5, “id”: “bar”}
{“seq”: 6, “id”: “bif”}

A _changes request which happens to commence when the database is at sequence 5 would return (ignoring the format of “seq” for simplicity)

{“seq”: 1, “id”: ”foo”}
{“seq”: 3, “id”: “baz”}
{“seq”: 4, “id”: “bif”}
{“seq”: 5, “id”: “bar”}

i.e., the first instance “bar” would be skipped over because a more recent version exists within the time horizon, but the first instance of “bif” would included because “seq”: 6 is outside our horizon.

The downside of this approach is someone has to go in and clean up tombstoned index entries eventually (or else provision lots and lots of storage space). One way we could do this (inside CouchDB) would be to have each _changes session record its read version somewhere, and then have a background process go in and remove tombstoned entries where the tombstone is less than the earliest read version of any active request. It’s doable, but definitely more load on the server.

Also, note this approach is not guaranteeing that the older versions of the documents referenced in those tombstoned entries are actually accessible. Much like today, the changes feed would include a revision identifier which, upon closer inspection, has been superseded by a more recent version of the document. Unlike today, that older version would be expunged from the database immediately if a descendant revision exists.

—

OK, so those are the two basic options. I’d particularly like to hear if the behavior described in Option A would prove problematic for certain use cases, as it’s the simpler and more efficient of the two options. Thanks!

Adam

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Posted by Robert Newson <rn...@apache.org>.

I said a bunch of this on IRC also but Adam has it. further operations within the 'expired' txn just fail. we recognise that and start a new one. In the _changes case, we'd send a last_seq row and end the request, but this isn't going to be a great answer (at least, not a backward compatible answer) for _view, _all_docs and _find.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Thu, 7 Mar 2019, at 12:37, Adam Kocoloski wrote:
> Bah, our “cue”, not our “queue” ;)
> 
> Adam
> 
> > On Mar 7, 2019, at 7:35 AM, Adam Kocoloski <ko...@apache.org> wrote:
> > 
> > Hi Garren,
> > 
> > In general we wouldn’t know ahead of time whether we can complete in five seconds. I believe the way it works is that we start a transaction, issue a bunch of reads, and after 5 seconds any additional reads will start to fail with something like “read version too old”. That’s our queue to start a new transaction. All the reads that completed successfully are fine, and the CouchDB API layer can certainly choose to start streaming as soon as the first read completes (~2ms after the beginning of the transaction).
> > 
> > Agree with Bob that steering towards a larger number of short-lived operations is the way to go in general. But I also want to balance that with backwards-compatibility where it makes sense.
> > 
> > Adam
> > 
> >> On Mar 7, 2019, at 7:22 AM, Garren Smith <ga...@apache.org> wrote:
> >> 
> >> I agree that option A seems the most sensibile. I just want to understand
> >> this comment:
> >> 
> >>>> A _changes request that cannot be satisfied within the 5 second limit
> >> will be implemented as multiple FoundationDB transactions under the covers
> >> 
> >> How will we know if a change request cannot be completed in 5 seconds? Can
> >> we tell that beforehand. Or would we try and complete a change request. The
> >> transaction fails after 5 seconds and then do multiple transactions to get
> >> the full changes? If that is the case the response from CouchDB to the user
> >> will be really slow as they have already waited 5 seconds and have still
> >> not received anything. Or if we start streaming a result back to the user
> >> in the first transaction (Is this even possible?) then we would somehow
> >> need to know how to continue the changes feed after the transaction has
> >> failed.
> >> 
> >> Then Bob from your comment:
> >> 
> >>>> Forcing clients to do short (<5s) requests feels like a general good, as
> >> long as meaningful things can be done in that time-frame, which I strongly
> >> believe from what we've said elsewhere that they can.
> >> 
> >> That makes sense, but how would we do that? How do you help a user to make
> >> sure their request is under 5 seconds?
> >> 
> >> Cheers
> >> Garren
> >> 
> >> 
> >> 
> >> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rn...@apache.org> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> Given that option A is the behaviour of feed=continuous today (barring the
> >>> initial whole-snapshot phase to catch up to "now") I think that's the right
> >>> move.  I confess to not reading your option B too deeply but I was there on
> >>> IRC when the first spark was lit. We can build some sort of temporary
> >>> multi-index on FDB today, that's clear, but it's equally clear that we
> >>> should avoid doing so if at all possible.
> >>> 
> >>> Perhaps the future Redwood storage engine for FDB will, as you say,
> >>> significantly improve on this, but, even if it does, I'm not 100% convinced
> >>> we should expose it. Forcing clients to do short (<5s) requests feels like
> >>> a general good, as long as meaningful things can be done in that
> >>> time-frame, which I strongly believe from what we've said elsewhere that
> >>> they can.
> >>> 
> >>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
> >>> experience in production, has a lot of endpoints of wildly varying
> >>> performance characteristics. It's right that we evolve away from that where
> >>> possible, and this seems a great candidate given the replicator in ~all
> >>> versions of CouchDB will handle the change without blinking.
> >>> 
> >>> We have the same issue for _all_docs and _view and _find, in that the user
> >>> might ask for more data back than can be sent within a single FDB
> >>> transaction. I suggest that's a new thread, though.
> >>> 
> >>> --
> >>> Robert Samuel Newson
> >>> rnewson@apache.org
> >>> 
> >>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
> >>>> Hi all, as the project devs are working through the design for the
> >>>> _changes feed in FoundationDB we’ve come across a limitation that is
> >>>> worth discussing with the broader user community. FoundationDB
> >>>> currently imposes a 5 second limit on all transactions, and read
> >>>> versions from old transactions are inaccessible after that window. This
> >>>> means that, unlike a single CouchDB storage shard, it is not possible
> >>>> to grab a long-lived snapshot of the entire database.
> >>>> 
> >>>> In extant versions of CouchDB we rely on this long-lived snapshot
> >>>> behavior for a number of operations, some of which are user-facing. For
> >>>> example, it is possible to make a request to the _changes feed for a
> >>>> database of an arbitrary size and, if you’ve got the storage space and
> >>>> time to spare, you can pull down a snapshot of the entire database in a
> >>>> single request. That snapshot will contain exactly one entry for each
> >>>> document in the database. In CouchDB 1.x the documents appear in the
> >>>> order in which they were most recently updated. In CouchDB 2.x there is
> >>>> no guaranteed ordering, although in practice the documents are roughly
> >>>> ordered by most recent edit. Note that you really do have to complete
> >>>> the operation in a single HTTP request; if you chunk up the requests or
> >>>> have to retry because the connection was severed then the exactly-once
> >>>> guarantees disappear.
> >>>> 
> >>>> We have a couple of different options for how we can implement _changes
> >>>> with FoundationDB as a backing store, I’ll describe them below and
> >>>> discuss the tradeoffs
> >>>> 
> >>>> ## Option A: Single Version Index, long-running operations as multiple
> >>>> transactions
> >>>> 
> >>>> In this option the internal index has exactly one entry for each
> >>>> document at all times. A _changes request that cannot be satisfied
> >>>> within the 5 second limit will be implemented as multiple FoundationDB
> >>>> transactions under the covers. These transactions will have different
> >>>> read versions, and a document that gets updated in between those read
> >>>> versions will show up *multiple times* in the response body. The entire
> >>>> feed will be totally ordered, and later occurrences of a particular
> >>>> document are guaranteed to represent more recent edits than than the
> >>>> earlier occurrences. In effect, it’s rather like the semantics of a
> >>>> feed=continuous request today, but with much better ordering and zero
> >>>> possibility of “rewinds” where large portions of the ID space get
> >>>> replayed because of issues in the cluster.
> >>>> 
> >>>> This option is very efficient internally and does not require any
> >>>> background maintenance. A future enhancement in FoundationDB’s storage
> >>>> engine is designed to enable longer-running read-only transactions, so
> >>>> we will likely to be able to improve the semantics with this option
> >>>> over time.
> >>>> 
> >>>> ## Option B: Multi-Version Index
> >>>> 
> >>>> In this design the internal index can contain multiple entries for a
> >>>> given document. Each entry includes the sequence at which the document
> >>>> edit was made, and may also include a sequence at which it was
> >>>> overwritten by a more recent edit.
> >>>> 
> >>>> The implementation of a _changes request would start by getting the
> >>>> current version of the datastore (call this the read version), and then
> >>>> as it examines entries in the index it would skip over any entries
> >>>> where there’s a “tombstone” sequence less than the read version.
> >>>> Crucially, if the request needs to be implemented across multiple
> >>>> transactions, each transaction would use the same read version when
> >>>> deciding whether to include entries in the index in the _changes
> >>>> response. The readers would know to stop when and if they encounter an
> >>>> entry where the created version is greater than the read version.
> >>>> Perhaps a diagram helps to clarify, a simplified version of the
> >>>> internal index might look like
> >>>> 
> >>>> {“seq”: 1, “id”: ”foo”}
> >>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
> >>>> {“seq”: 3, “id”: “baz”}
> >>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
> >>>> {“seq”: 5, “id”: “bar”}
> >>>> {“seq”: 6, “id”: “bif”}
> >>>> 
> >>>> A _changes request which happens to commence when the database is at
> >>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
> >>>> 
> >>>> {“seq”: 1, “id”: ”foo”}
> >>>> {“seq”: 3, “id”: “baz”}
> >>>> {“seq”: 4, “id”: “bif”}
> >>>> {“seq”: 5, “id”: “bar”}
> >>>> 
> >>>> i.e., the first instance “bar” would be skipped over because a more
> >>>> recent version exists within the time horizon, but the first instance
> >>>> of “bif” would included because “seq”: 6 is outside our horizon.
> >>>> 
> >>>> The downside of this approach is someone has to go in and clean up
> >>>> tombstoned index entries eventually (or else provision lots and lots of
> >>>> storage space). One way we could do this (inside CouchDB) would be to
> >>>> have each _changes session record its read version somewhere, and then
> >>>> have a background process go in and remove tombstoned entries where the
> >>>> tombstone is less than the earliest read version of any active request.
> >>>> It’s doable, but definitely more load on the server.
> >>>> 
> >>>> Also, note this approach is not guaranteeing that the older versions of
> >>>> the documents referenced in those tombstoned entries are actually
> >>>> accessible. Much like today, the changes feed would include a revision
> >>>> identifier which, upon closer inspection, has been superseded by a more
> >>>> recent version of the document. Unlike today, that older version would
> >>>> be expunged from the database immediately if a descendant revision
> >>>> exists.
> >>>> 
> >>>> —
> >>>> 
> >>>> OK, so those are the two basic options. I’d particularly like to hear
> >>>> if the behavior described in Option A would prove problematic for
> >>>> certain use cases, as it’s the simpler and more efficient of the two
> >>>> options. Thanks!
> >>>> 
> >>>> Adam
> >>> 
> > 
> 
>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Posted by Adam Kocoloski <ko...@apache.org>.

Bah, our “cue”, not our “queue” ;)

Adam

> On Mar 7, 2019, at 7:35 AM, Adam Kocoloski <ko...@apache.org> wrote:
> 
> Hi Garren,
> 
> In general we wouldn’t know ahead of time whether we can complete in five seconds. I believe the way it works is that we start a transaction, issue a bunch of reads, and after 5 seconds any additional reads will start to fail with something like “read version too old”. That’s our queue to start a new transaction. All the reads that completed successfully are fine, and the CouchDB API layer can certainly choose to start streaming as soon as the first read completes (~2ms after the beginning of the transaction).
> 
> Agree with Bob that steering towards a larger number of short-lived operations is the way to go in general. But I also want to balance that with backwards-compatibility where it makes sense.
> 
> Adam
> 
>> On Mar 7, 2019, at 7:22 AM, Garren Smith <ga...@apache.org> wrote:
>> 
>> I agree that option A seems the most sensibile. I just want to understand
>> this comment:
>> 
>>>> A _changes request that cannot be satisfied within the 5 second limit
>> will be implemented as multiple FoundationDB transactions under the covers
>> 
>> How will we know if a change request cannot be completed in 5 seconds? Can
>> we tell that beforehand. Or would we try and complete a change request. The
>> transaction fails after 5 seconds and then do multiple transactions to get
>> the full changes? If that is the case the response from CouchDB to the user
>> will be really slow as they have already waited 5 seconds and have still
>> not received anything. Or if we start streaming a result back to the user
>> in the first transaction (Is this even possible?) then we would somehow
>> need to know how to continue the changes feed after the transaction has
>> failed.
>> 
>> Then Bob from your comment:
>> 
>>>> Forcing clients to do short (<5s) requests feels like a general good, as
>> long as meaningful things can be done in that time-frame, which I strongly
>> believe from what we've said elsewhere that they can.
>> 
>> That makes sense, but how would we do that? How do you help a user to make
>> sure their request is under 5 seconds?
>> 
>> Cheers
>> Garren
>> 
>> 
>> 
>> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rn...@apache.org> wrote:
>> 
>>> Hi,
>>> 
>>> Given that option A is the behaviour of feed=continuous today (barring the
>>> initial whole-snapshot phase to catch up to "now") I think that's the right
>>> move.  I confess to not reading your option B too deeply but I was there on
>>> IRC when the first spark was lit. We can build some sort of temporary
>>> multi-index on FDB today, that's clear, but it's equally clear that we
>>> should avoid doing so if at all possible.
>>> 
>>> Perhaps the future Redwood storage engine for FDB will, as you say,
>>> significantly improve on this, but, even if it does, I'm not 100% convinced
>>> we should expose it. Forcing clients to do short (<5s) requests feels like
>>> a general good, as long as meaningful things can be done in that
>>> time-frame, which I strongly believe from what we've said elsewhere that
>>> they can.
>>> 
>>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
>>> experience in production, has a lot of endpoints of wildly varying
>>> performance characteristics. It's right that we evolve away from that where
>>> possible, and this seems a great candidate given the replicator in ~all
>>> versions of CouchDB will handle the change without blinking.
>>> 
>>> We have the same issue for _all_docs and _view and _find, in that the user
>>> might ask for more data back than can be sent within a single FDB
>>> transaction. I suggest that's a new thread, though.
>>> 
>>> --
>>> Robert Samuel Newson
>>> rnewson@apache.org
>>> 
>>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
>>>> Hi all, as the project devs are working through the design for the
>>>> _changes feed in FoundationDB we’ve come across a limitation that is
>>>> worth discussing with the broader user community. FoundationDB
>>>> currently imposes a 5 second limit on all transactions, and read
>>>> versions from old transactions are inaccessible after that window. This
>>>> means that, unlike a single CouchDB storage shard, it is not possible
>>>> to grab a long-lived snapshot of the entire database.
>>>> 
>>>> In extant versions of CouchDB we rely on this long-lived snapshot
>>>> behavior for a number of operations, some of which are user-facing. For
>>>> example, it is possible to make a request to the _changes feed for a
>>>> database of an arbitrary size and, if you’ve got the storage space and
>>>> time to spare, you can pull down a snapshot of the entire database in a
>>>> single request. That snapshot will contain exactly one entry for each
>>>> document in the database. In CouchDB 1.x the documents appear in the
>>>> order in which they were most recently updated. In CouchDB 2.x there is
>>>> no guaranteed ordering, although in practice the documents are roughly
>>>> ordered by most recent edit. Note that you really do have to complete
>>>> the operation in a single HTTP request; if you chunk up the requests or
>>>> have to retry because the connection was severed then the exactly-once
>>>> guarantees disappear.
>>>> 
>>>> We have a couple of different options for how we can implement _changes
>>>> with FoundationDB as a backing store, I’ll describe them below and
>>>> discuss the tradeoffs
>>>> 
>>>> ## Option A: Single Version Index, long-running operations as multiple
>>>> transactions
>>>> 
>>>> In this option the internal index has exactly one entry for each
>>>> document at all times. A _changes request that cannot be satisfied
>>>> within the 5 second limit will be implemented as multiple FoundationDB
>>>> transactions under the covers. These transactions will have different
>>>> read versions, and a document that gets updated in between those read
>>>> versions will show up *multiple times* in the response body. The entire
>>>> feed will be totally ordered, and later occurrences of a particular
>>>> document are guaranteed to represent more recent edits than than the
>>>> earlier occurrences. In effect, it’s rather like the semantics of a
>>>> feed=continuous request today, but with much better ordering and zero
>>>> possibility of “rewinds” where large portions of the ID space get
>>>> replayed because of issues in the cluster.
>>>> 
>>>> This option is very efficient internally and does not require any
>>>> background maintenance. A future enhancement in FoundationDB’s storage
>>>> engine is designed to enable longer-running read-only transactions, so
>>>> we will likely to be able to improve the semantics with this option
>>>> over time.
>>>> 
>>>> ## Option B: Multi-Version Index
>>>> 
>>>> In this design the internal index can contain multiple entries for a
>>>> given document. Each entry includes the sequence at which the document
>>>> edit was made, and may also include a sequence at which it was
>>>> overwritten by a more recent edit.
>>>> 
>>>> The implementation of a _changes request would start by getting the
>>>> current version of the datastore (call this the read version), and then
>>>> as it examines entries in the index it would skip over any entries
>>>> where there’s a “tombstone” sequence less than the read version.
>>>> Crucially, if the request needs to be implemented across multiple
>>>> transactions, each transaction would use the same read version when
>>>> deciding whether to include entries in the index in the _changes
>>>> response. The readers would know to stop when and if they encounter an
>>>> entry where the created version is greater than the read version.
>>>> Perhaps a diagram helps to clarify, a simplified version of the
>>>> internal index might look like
>>>> 
>>>> {“seq”: 1, “id”: ”foo”}
>>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
>>>> {“seq”: 3, “id”: “baz”}
>>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
>>>> {“seq”: 5, “id”: “bar”}
>>>> {“seq”: 6, “id”: “bif”}
>>>> 
>>>> A _changes request which happens to commence when the database is at
>>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
>>>> 
>>>> {“seq”: 1, “id”: ”foo”}
>>>> {“seq”: 3, “id”: “baz”}
>>>> {“seq”: 4, “id”: “bif”}
>>>> {“seq”: 5, “id”: “bar”}
>>>> 
>>>> i.e., the first instance “bar” would be skipped over because a more
>>>> recent version exists within the time horizon, but the first instance
>>>> of “bif” would included because “seq”: 6 is outside our horizon.
>>>> 
>>>> The downside of this approach is someone has to go in and clean up
>>>> tombstoned index entries eventually (or else provision lots and lots of
>>>> storage space). One way we could do this (inside CouchDB) would be to
>>>> have each _changes session record its read version somewhere, and then
>>>> have a background process go in and remove tombstoned entries where the
>>>> tombstone is less than the earliest read version of any active request.
>>>> It’s doable, but definitely more load on the server.
>>>> 
>>>> Also, note this approach is not guaranteeing that the older versions of
>>>> the documents referenced in those tombstoned entries are actually
>>>> accessible. Much like today, the changes feed would include a revision
>>>> identifier which, upon closer inspection, has been superseded by a more
>>>> recent version of the document. Unlike today, that older version would
>>>> be expunged from the database immediately if a descendant revision
>>>> exists.
>>>> 
>>>> —
>>>> 
>>>> OK, so those are the two basic options. I’d particularly like to hear
>>>> if the behavior described in Option A would prove problematic for
>>>> certain use cases, as it’s the simpler and more efficient of the two
>>>> options. Thanks!
>>>> 
>>>> Adam
>>> 
>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Garren,

In general we wouldn’t know ahead of time whether we can complete in five seconds. I believe the way it works is that we start a transaction, issue a bunch of reads, and after 5 seconds any additional reads will start to fail with something like “read version too old”. That’s our queue to start a new transaction. All the reads that completed successfully are fine, and the CouchDB API layer can certainly choose to start streaming as soon as the first read completes (~2ms after the beginning of the transaction).

Agree with Bob that steering towards a larger number of short-lived operations is the way to go in general. But I also want to balance that with backwards-compatibility where it makes sense.

Adam

> On Mar 7, 2019, at 7:22 AM, Garren Smith <ga...@apache.org> wrote:
> 
> I agree that option A seems the most sensibile. I just want to understand
> this comment:
> 
>>> A _changes request that cannot be satisfied within the 5 second limit
> will be implemented as multiple FoundationDB transactions under the covers
> 
> How will we know if a change request cannot be completed in 5 seconds? Can
> we tell that beforehand. Or would we try and complete a change request. The
> transaction fails after 5 seconds and then do multiple transactions to get
> the full changes? If that is the case the response from CouchDB to the user
> will be really slow as they have already waited 5 seconds and have still
> not received anything. Or if we start streaming a result back to the user
> in the first transaction (Is this even possible?) then we would somehow
> need to know how to continue the changes feed after the transaction has
> failed.
> 
> Then Bob from your comment:
> 
>>> Forcing clients to do short (<5s) requests feels like a general good, as
> long as meaningful things can be done in that time-frame, which I strongly
> believe from what we've said elsewhere that they can.
> 
> That makes sense, but how would we do that? How do you help a user to make
> sure their request is under 5 seconds?
> 
> Cheers
> Garren
> 
> 
> 
> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rn...@apache.org> wrote:
> 
>> Hi,
>> 
>> Given that option A is the behaviour of feed=continuous today (barring the
>> initial whole-snapshot phase to catch up to "now") I think that's the right
>> move.  I confess to not reading your option B too deeply but I was there on
>> IRC when the first spark was lit. We can build some sort of temporary
>> multi-index on FDB today, that's clear, but it's equally clear that we
>> should avoid doing so if at all possible.
>> 
>> Perhaps the future Redwood storage engine for FDB will, as you say,
>> significantly improve on this, but, even if it does, I'm not 100% convinced
>> we should expose it. Forcing clients to do short (<5s) requests feels like
>> a general good, as long as meaningful things can be done in that
>> time-frame, which I strongly believe from what we've said elsewhere that
>> they can.
>> 
>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
>> experience in production, has a lot of endpoints of wildly varying
>> performance characteristics. It's right that we evolve away from that where
>> possible, and this seems a great candidate given the replicator in ~all
>> versions of CouchDB will handle the change without blinking.
>> 
>> We have the same issue for _all_docs and _view and _find, in that the user
>> might ask for more data back than can be sent within a single FDB
>> transaction. I suggest that's a new thread, though.
>> 
>> --
>>  Robert Samuel Newson
>>  rnewson@apache.org
>> 
>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
>>> Hi all, as the project devs are working through the design for the
>>> _changes feed in FoundationDB we’ve come across a limitation that is
>>> worth discussing with the broader user community. FoundationDB
>>> currently imposes a 5 second limit on all transactions, and read
>>> versions from old transactions are inaccessible after that window. This
>>> means that, unlike a single CouchDB storage shard, it is not possible
>>> to grab a long-lived snapshot of the entire database.
>>> 
>>> In extant versions of CouchDB we rely on this long-lived snapshot
>>> behavior for a number of operations, some of which are user-facing. For
>>> example, it is possible to make a request to the _changes feed for a
>>> database of an arbitrary size and, if you’ve got the storage space and
>>> time to spare, you can pull down a snapshot of the entire database in a
>>> single request. That snapshot will contain exactly one entry for each
>>> document in the database. In CouchDB 1.x the documents appear in the
>>> order in which they were most recently updated. In CouchDB 2.x there is
>>> no guaranteed ordering, although in practice the documents are roughly
>>> ordered by most recent edit. Note that you really do have to complete
>>> the operation in a single HTTP request; if you chunk up the requests or
>>> have to retry because the connection was severed then the exactly-once
>>> guarantees disappear.
>>> 
>>> We have a couple of different options for how we can implement _changes
>>> with FoundationDB as a backing store, I’ll describe them below and
>>> discuss the tradeoffs
>>> 
>>> ## Option A: Single Version Index, long-running operations as multiple
>>> transactions
>>> 
>>> In this option the internal index has exactly one entry for each
>>> document at all times. A _changes request that cannot be satisfied
>>> within the 5 second limit will be implemented as multiple FoundationDB
>>> transactions under the covers. These transactions will have different
>>> read versions, and a document that gets updated in between those read
>>> versions will show up *multiple times* in the response body. The entire
>>> feed will be totally ordered, and later occurrences of a particular
>>> document are guaranteed to represent more recent edits than than the
>>> earlier occurrences. In effect, it’s rather like the semantics of a
>>> feed=continuous request today, but with much better ordering and zero
>>> possibility of “rewinds” where large portions of the ID space get
>>> replayed because of issues in the cluster.
>>> 
>>> This option is very efficient internally and does not require any
>>> background maintenance. A future enhancement in FoundationDB’s storage
>>> engine is designed to enable longer-running read-only transactions, so
>>> we will likely to be able to improve the semantics with this option
>>> over time.
>>> 
>>> ## Option B: Multi-Version Index
>>> 
>>> In this design the internal index can contain multiple entries for a
>>> given document. Each entry includes the sequence at which the document
>>> edit was made, and may also include a sequence at which it was
>>> overwritten by a more recent edit.
>>> 
>>> The implementation of a _changes request would start by getting the
>>> current version of the datastore (call this the read version), and then
>>> as it examines entries in the index it would skip over any entries
>>> where there’s a “tombstone” sequence less than the read version.
>>> Crucially, if the request needs to be implemented across multiple
>>> transactions, each transaction would use the same read version when
>>> deciding whether to include entries in the index in the _changes
>>> response. The readers would know to stop when and if they encounter an
>>> entry where the created version is greater than the read version.
>>> Perhaps a diagram helps to clarify, a simplified version of the
>>> internal index might look like
>>> 
>>> {“seq”: 1, “id”: ”foo”}
>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
>>> {“seq”: 3, “id”: “baz”}
>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
>>> {“seq”: 5, “id”: “bar”}
>>> {“seq”: 6, “id”: “bif”}
>>> 
>>> A _changes request which happens to commence when the database is at
>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
>>> 
>>> {“seq”: 1, “id”: ”foo”}
>>> {“seq”: 3, “id”: “baz”}
>>> {“seq”: 4, “id”: “bif”}
>>> {“seq”: 5, “id”: “bar”}
>>> 
>>> i.e., the first instance “bar” would be skipped over because a more
>>> recent version exists within the time horizon, but the first instance
>>> of “bif” would included because “seq”: 6 is outside our horizon.
>>> 
>>> The downside of this approach is someone has to go in and clean up
>>> tombstoned index entries eventually (or else provision lots and lots of
>>> storage space). One way we could do this (inside CouchDB) would be to
>>> have each _changes session record its read version somewhere, and then
>>> have a background process go in and remove tombstoned entries where the
>>> tombstone is less than the earliest read version of any active request.
>>> It’s doable, but definitely more load on the server.
>>> 
>>> Also, note this approach is not guaranteeing that the older versions of
>>> the documents referenced in those tombstoned entries are actually
>>> accessible. Much like today, the changes feed would include a revision
>>> identifier which, upon closer inspection, has been superseded by a more
>>> recent version of the document. Unlike today, that older version would
>>> be expunged from the database immediately if a descendant revision
>>> exists.
>>> 
>>> —
>>> 
>>> OK, so those are the two basic options. I’d particularly like to hear
>>> if the behavior described in Option A would prove problematic for
>>> certain use cases, as it’s the simpler and more efficient of the two
>>> options. Thanks!
>>> 
>>> Adam
>>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Posted by Garren Smith <ga...@apache.org>.

I agree that option A seems the most sensibile. I just want to understand
this comment:

>>  A _changes request that cannot be satisfied within the 5 second limit
will be implemented as multiple FoundationDB transactions under the covers

How will we know if a change request cannot be completed in 5 seconds? Can
we tell that beforehand. Or would we try and complete a change request. The
transaction fails after 5 seconds and then do multiple transactions to get
the full changes? If that is the case the response from CouchDB to the user
will be really slow as they have already waited 5 seconds and have still
not received anything. Or if we start streaming a result back to the user
in the first transaction (Is this even possible?) then we would somehow
need to know how to continue the changes feed after the transaction has
failed.

Then Bob from your comment:

>> Forcing clients to do short (<5s) requests feels like a general good, as
long as meaningful things can be done in that time-frame, which I strongly
believe from what we've said elsewhere that they can.

That makes sense, but how would we do that? How do you help a user to make
sure their request is under 5 seconds?

Cheers
Garren



On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rn...@apache.org> wrote:

> Hi,
>
> Given that option A is the behaviour of feed=continuous today (barring the
> initial whole-snapshot phase to catch up to "now") I think that's the right
> move.  I confess to not reading your option B too deeply but I was there on
> IRC when the first spark was lit. We can build some sort of temporary
> multi-index on FDB today, that's clear, but it's equally clear that we
> should avoid doing so if at all possible.
>
> Perhaps the future Redwood storage engine for FDB will, as you say,
> significantly improve on this, but, even if it does, I'm not 100% convinced
> we should expose it. Forcing clients to do short (<5s) requests feels like
> a general good, as long as meaningful things can be done in that
> time-frame, which I strongly believe from what we've said elsewhere that
> they can.
>
> CouchDB's API, as we both know from rich (heh, and sometimes poor)
> experience in production, has a lot of endpoints of wildly varying
> performance characteristics. It's right that we evolve away from that where
> possible, and this seems a great candidate given the replicator in ~all
> versions of CouchDB will handle the change without blinking.
>
> We have the same issue for _all_docs and _view and _find, in that the user
> might ask for more data back than can be sent within a single FDB
> transaction. I suggest that's a new thread, though.
>
> --
>   Robert Samuel Newson
>   rnewson@apache.org
>
> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
> > Hi all, as the project devs are working through the design for the
> > _changes feed in FoundationDB we’ve come across a limitation that is
> > worth discussing with the broader user community. FoundationDB
> > currently imposes a 5 second limit on all transactions, and read
> > versions from old transactions are inaccessible after that window. This
> > means that, unlike a single CouchDB storage shard, it is not possible
> > to grab a long-lived snapshot of the entire database.
> >
> > In extant versions of CouchDB we rely on this long-lived snapshot
> > behavior for a number of operations, some of which are user-facing. For
> > example, it is possible to make a request to the _changes feed for a
> > database of an arbitrary size and, if you’ve got the storage space and
> > time to spare, you can pull down a snapshot of the entire database in a
> > single request. That snapshot will contain exactly one entry for each
> > document in the database. In CouchDB 1.x the documents appear in the
> > order in which they were most recently updated. In CouchDB 2.x there is
> > no guaranteed ordering, although in practice the documents are roughly
> > ordered by most recent edit. Note that you really do have to complete
> > the operation in a single HTTP request; if you chunk up the requests or
> > have to retry because the connection was severed then the exactly-once
> > guarantees disappear.
> >
> > We have a couple of different options for how we can implement _changes
> > with FoundationDB as a backing store, I’ll describe them below and
> > discuss the tradeoffs
> >
> > ## Option A: Single Version Index, long-running operations as multiple
> > transactions
> >
> > In this option the internal index has exactly one entry for each
> > document at all times. A _changes request that cannot be satisfied
> > within the 5 second limit will be implemented as multiple FoundationDB
> > transactions under the covers. These transactions will have different
> > read versions, and a document that gets updated in between those read
> > versions will show up *multiple times* in the response body. The entire
> > feed will be totally ordered, and later occurrences of a particular
> > document are guaranteed to represent more recent edits than than the
> > earlier occurrences. In effect, it’s rather like the semantics of a
> > feed=continuous request today, but with much better ordering and zero
> > possibility of “rewinds” where large portions of the ID space get
> > replayed because of issues in the cluster.
> >
> > This option is very efficient internally and does not require any
> > background maintenance. A future enhancement in FoundationDB’s storage
> > engine is designed to enable longer-running read-only transactions, so
> > we will likely to be able to improve the semantics with this option
> > over time.
> >
> > ## Option B: Multi-Version Index
> >
> > In this design the internal index can contain multiple entries for a
> > given document. Each entry includes the sequence at which the document
> > edit was made, and may also include a sequence at which it was
> > overwritten by a more recent edit.
> >
> > The implementation of a _changes request would start by getting the
> > current version of the datastore (call this the read version), and then
> > as it examines entries in the index it would skip over any entries
> > where there’s a “tombstone” sequence less than the read version.
> > Crucially, if the request needs to be implemented across multiple
> > transactions, each transaction would use the same read version when
> > deciding whether to include entries in the index in the _changes
> > response. The readers would know to stop when and if they encounter an
> > entry where the created version is greater than the read version.
> > Perhaps a diagram helps to clarify, a simplified version of the
> > internal index might look like
> >
> > {“seq”: 1, “id”: ”foo”}
> > {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
> > {“seq”: 3, “id”: “baz”}
> > {“seq”: 4, “id”: “bif”, “tombstone": 6}
> > {“seq”: 5, “id”: “bar”}
> > {“seq”: 6, “id”: “bif”}
> >
> > A _changes request which happens to commence when the database is at
> > sequence 5 would return (ignoring the format of “seq” for simplicity)
> >
> > {“seq”: 1, “id”: ”foo”}
> > {“seq”: 3, “id”: “baz”}
> > {“seq”: 4, “id”: “bif”}
> > {“seq”: 5, “id”: “bar”}
> >
> > i.e., the first instance “bar” would be skipped over because a more
> > recent version exists within the time horizon, but the first instance
> > of “bif” would included because “seq”: 6 is outside our horizon.
> >
> > The downside of this approach is someone has to go in and clean up
> > tombstoned index entries eventually (or else provision lots and lots of
> > storage space). One way we could do this (inside CouchDB) would be to
> > have each _changes session record its read version somewhere, and then
> > have a background process go in and remove tombstoned entries where the
> > tombstone is less than the earliest read version of any active request.
> > It’s doable, but definitely more load on the server.
> >
> > Also, note this approach is not guaranteeing that the older versions of
> > the documents referenced in those tombstoned entries are actually
> > accessible. Much like today, the changes feed would include a revision
> > identifier which, upon closer inspection, has been superseded by a more
> > recent version of the document. Unlike today, that older version would
> > be expunged from the database immediately if a descendant revision
> > exists.
> >
> > —
> >
> > OK, so those are the two basic options. I’d particularly like to hear
> > if the behavior described in Option A would prove problematic for
> > certain use cases, as it’s the simpler and more efficient of the two
> > options. Thanks!
> >
> > Adam
>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Posted by Robert Newson <rn...@apache.org>.

Hi,

Given that option A is the behaviour of feed=continuous today (barring the initial whole-snapshot phase to catch up to "now") I think that's the right move.  I confess to not reading your option B too deeply but I was there on IRC when the first spark was lit. We can build some sort of temporary multi-index on FDB today, that's clear, but it's equally clear that we should avoid doing so if at all possible. 

Perhaps the future Redwood storage engine for FDB will, as you say, significantly improve on this, but, even if it does, I'm not 100% convinced we should expose it. Forcing clients to do short (<5s) requests feels like a general good, as long as meaningful things can be done in that time-frame, which I strongly believe from what we've said elsewhere that they can.

CouchDB's API, as we both know from rich (heh, and sometimes poor) experience in production, has a lot of endpoints of wildly varying performance characteristics. It's right that we evolve away from that where possible, and this seems a great candidate given the replicator in ~all versions of CouchDB will handle the change without blinking.

We have the same issue for _all_docs and _view and _find, in that the user might ask for more data back than can be sent within a single FDB transaction. I suggest that's a new thread, though.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
> Hi all, as the project devs are working through the design for the 
> _changes feed in FoundationDB we’ve come across a limitation that is 
> worth discussing with the broader user community. FoundationDB 
> currently imposes a 5 second limit on all transactions, and read 
> versions from old transactions are inaccessible after that window. This 
> means that, unlike a single CouchDB storage shard, it is not possible 
> to grab a long-lived snapshot of the entire database.
> 
> In extant versions of CouchDB we rely on this long-lived snapshot 
> behavior for a number of operations, some of which are user-facing. For 
> example, it is possible to make a request to the _changes feed for a 
> database of an arbitrary size and, if you’ve got the storage space and 
> time to spare, you can pull down a snapshot of the entire database in a 
> single request. That snapshot will contain exactly one entry for each 
> document in the database. In CouchDB 1.x the documents appear in the 
> order in which they were most recently updated. In CouchDB 2.x there is 
> no guaranteed ordering, although in practice the documents are roughly 
> ordered by most recent edit. Note that you really do have to complete 
> the operation in a single HTTP request; if you chunk up the requests or 
> have to retry because the connection was severed then the exactly-once 
> guarantees disappear.
> 
> We have a couple of different options for how we can implement _changes 
> with FoundationDB as a backing store, I’ll describe them below and 
> discuss the tradeoffs
> 
> ## Option A: Single Version Index, long-running operations as multiple 
> transactions
> 
> In this option the internal index has exactly one entry for each 
> document at all times. A _changes request that cannot be satisfied 
> within the 5 second limit will be implemented as multiple FoundationDB 
> transactions under the covers. These transactions will have different 
> read versions, and a document that gets updated in between those read 
> versions will show up *multiple times* in the response body. The entire 
> feed will be totally ordered, and later occurrences of a particular 
> document are guaranteed to represent more recent edits than than the 
> earlier occurrences. In effect, it’s rather like the semantics of a 
> feed=continuous request today, but with much better ordering and zero 
> possibility of “rewinds” where large portions of the ID space get 
> replayed because of issues in the cluster.
> 
> This option is very efficient internally and does not require any 
> background maintenance. A future enhancement in FoundationDB’s storage 
> engine is designed to enable longer-running read-only transactions, so 
> we will likely to be able to improve the semantics with this option 
> over time.
> 
> ## Option B: Multi-Version Index
> 
> In this design the internal index can contain multiple entries for a 
> given document. Each entry includes the sequence at which the document 
> edit was made, and may also include a sequence at which it was 
> overwritten by a more recent edit.
> 
> The implementation of a _changes request would start by getting the 
> current version of the datastore (call this the read version), and then 
> as it examines entries in the index it would skip over any entries 
> where there’s a “tombstone” sequence less than the read version. 
> Crucially, if the request needs to be implemented across multiple 
> transactions, each transaction would use the same read version when 
> deciding whether to include entries in the index in the _changes 
> response. The readers would know to stop when and if they encounter an 
> entry where the created version is greater than the read version. 
> Perhaps a diagram helps to clarify, a simplified version of the 
> internal index might look like
> 
> {“seq”: 1, “id”: ”foo”}
> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
> {“seq”: 3, “id”: “baz”}
> {“seq”: 4, “id”: “bif”, “tombstone": 6}
> {“seq”: 5, “id”: “bar”}
> {“seq”: 6, “id”: “bif”}
> 
> A _changes request which happens to commence when the database is at 
> sequence 5 would return (ignoring the format of “seq” for simplicity)
> 
> {“seq”: 1, “id”: ”foo”}
> {“seq”: 3, “id”: “baz”}
> {“seq”: 4, “id”: “bif”}
> {“seq”: 5, “id”: “bar”}
> 
> i.e., the first instance “bar” would be skipped over because a more 
> recent version exists within the time horizon, but the first instance 
> of “bif” would included because “seq”: 6 is outside our horizon.
> 
> The downside of this approach is someone has to go in and clean up 
> tombstoned index entries eventually (or else provision lots and lots of 
> storage space). One way we could do this (inside CouchDB) would be to 
> have each _changes session record its read version somewhere, and then 
> have a background process go in and remove tombstoned entries where the 
> tombstone is less than the earliest read version of any active request. 
> It’s doable, but definitely more load on the server.
> 
> Also, note this approach is not guaranteeing that the older versions of 
> the documents referenced in those tombstoned entries are actually 
> accessible. Much like today, the changes feed would include a revision 
> identifier which, upon closer inspection, has been superseded by a more 
> recent version of the document. Unlike today, that older version would 
> be expunged from the database immediately if a descendant revision 
> exists.
> 
> —
> 
> OK, so those are the two basic options. I’d particularly like to hear 
> if the behavior described in Option A would prove problematic for 
> certain use cases, as it’s the simpler and more efficient of the two 
> options. Thanks!
> 
> Adam