You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2019/03/21 19:50:37 UTC

[DISCUSS] Implementing _all_docs on FoundationDB

Hi all, me again. This one will be shorter :) As I see it we have three different options for serving the _all_docs endpoint from FDB: 

## Option 1: Read the document data, discard the bodies

We likely will have the documents stored in docid order already; we could do range reads and discard everything but the ID and _rev by default. This can be a very efficient implementation of include_docs=true (though one needs to be careful about skipping the conflict bodies), but pretty wasteful otherwise.

## Option 2: Read the “revisions” subspace

We also have an entry for every document in ID order in the “revisions” subspace. The disadvantage of this approach is that every deleted edit branch shows up there, too, and some databases will have lots of deleted documents. We may need to build skiplists to know how to scan efficiently. This subspace is also doing a lot of heavy lifting for us already, and if we wanted to toy with alternative revision history representations in the future it could get complicated

## Option 3: Add specific entries to support _all_docs

We can also write an extra KV containing the ID and winning _rev in a special subspace just to support this endpoint. It would be a blind write because we’re already coordinating concurrent transactions through reads on the “revisions” subspace. This would be conceptually quite clean and simple, and the fastest implementation for constructing the default response.

===

My sense is Option 2 is a non-starter but I include it for completeness in case anyone else thought of the same. I think Option 3 is a reasonable space / efficiency / simplicity tradeoff, and it might also be worth testing out Option 1 as an optimized implementation for include_docs=true.

Thoughts? I imagine we can move quickly to an RFC for at least having the extra KVs for Option 3, and in that design also acknowledge the option for scanning the docs space directly to support include_docs.

Adam

Re: [DISCUSS] Implementing _all_docs on FoundationDB

Posted by Garren Smith <ga...@apache.org>.
A little behind on the discussion emails but +1 to option 1 for
include_docs=true and option 3 for include_docs = false.

On Mon, Mar 25, 2019 at 12:26 PM Jan Lehnardt <ja...@apache.org> wrote:

> +1 on what Bob said.
>
> > On 21. Mar 2019, at 20:57, Robert Newson <rn...@apache.org> wrote:
> >
> > Hi,
> >
> > Thanks for pushing forward, and I owe feedback on other threads you've
> started.
> >
> > Rather feebly, I'm just agreeing with you. option 3 for
> include_docs=false and option 1 for include_docs=true sounds ideal. both
> flavours are very common so it makes sense to build a solution for each. At
> a pinch we can just do option 3 + async doc lookups in a first release and
> then circle back, but the RFC should propose 1 and 3 as our design
> intention.
> >
> > --
> >  Robert Samuel Newson
> >  rnewson@apache.org
> >
> > On Thu, 21 Mar 2019, at 19:50, Adam Kocoloski wrote:
> >> Hi all, me again. This one will be shorter :) As I see it we have three
> >> different options for serving the _all_docs endpoint from FDB:
> >>
> >> ## Option 1: Read the document data, discard the bodies
> >>
> >> We likely will have the documents stored in docid order already; we
> >> could do range reads and discard everything but the ID and _rev by
> >> default. This can be a very efficient implementation of
> >> include_docs=true (though one needs to be careful about skipping the
> >> conflict bodies), but pretty wasteful otherwise.
> >>
> >> ## Option 2: Read the “revisions” subspace
> >>
> >> We also have an entry for every document in ID order in the “revisions”
> >> subspace. The disadvantage of this approach is that every deleted edit
> >> branch shows up there, too, and some databases will have lots of
> >> deleted documents. We may need to build skiplists to know how to scan
> >> efficiently. This subspace is also doing a lot of heavy lifting for us
> >> already, and if we wanted to toy with alternative revision history
> >> representations in the future it could get complicated
> >>
> >> ## Option 3: Add specific entries to support _all_docs
> >>
> >> We can also write an extra KV containing the ID and winning _rev in a
> >> special subspace just to support this endpoint. It would be a blind
> >> write because we’re already coordinating concurrent transactions
> >> through reads on the “revisions” subspace. This would be conceptually
> >> quite clean and simple, and the fastest implementation for constructing
> >> the default response.
> >>
> >> ===
> >>
> >> My sense is Option 2 is a non-starter but I include it for completeness
> >> in case anyone else thought of the same. I think Option 3 is a
> >> reasonable space / efficiency / simplicity tradeoff, and it might also
> >> be worth testing out Option 1 as an optimized implementation for
> >> include_docs=true.
> >>
> >> Thoughts? I imagine we can move quickly to an RFC for at least having
> >> the extra KVs for Option 3, and in that design also acknowledge the
> >> option for scanning the docs space directly to support include_docs.
> >>
> >> Adam
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>

Re: [DISCUSS] Implementing _all_docs on FoundationDB

Posted by Jan Lehnardt <ja...@apache.org>.
+1 on what Bob said.

> On 21. Mar 2019, at 20:57, Robert Newson <rn...@apache.org> wrote:
> 
> Hi,
> 
> Thanks for pushing forward, and I owe feedback on other threads you've started.
> 
> Rather feebly, I'm just agreeing with you. option 3 for include_docs=false and option 1 for include_docs=true sounds ideal. both flavours are very common so it makes sense to build a solution for each. At a pinch we can just do option 3 + async doc lookups in a first release and then circle back, but the RFC should propose 1 and 3 as our design intention.
> 
> -- 
>  Robert Samuel Newson
>  rnewson@apache.org
> 
> On Thu, 21 Mar 2019, at 19:50, Adam Kocoloski wrote:
>> Hi all, me again. This one will be shorter :) As I see it we have three 
>> different options for serving the _all_docs endpoint from FDB: 
>> 
>> ## Option 1: Read the document data, discard the bodies
>> 
>> We likely will have the documents stored in docid order already; we 
>> could do range reads and discard everything but the ID and _rev by 
>> default. This can be a very efficient implementation of 
>> include_docs=true (though one needs to be careful about skipping the 
>> conflict bodies), but pretty wasteful otherwise.
>> 
>> ## Option 2: Read the “revisions” subspace
>> 
>> We also have an entry for every document in ID order in the “revisions” 
>> subspace. The disadvantage of this approach is that every deleted edit 
>> branch shows up there, too, and some databases will have lots of 
>> deleted documents. We may need to build skiplists to know how to scan 
>> efficiently. This subspace is also doing a lot of heavy lifting for us 
>> already, and if we wanted to toy with alternative revision history 
>> representations in the future it could get complicated
>> 
>> ## Option 3: Add specific entries to support _all_docs
>> 
>> We can also write an extra KV containing the ID and winning _rev in a 
>> special subspace just to support this endpoint. It would be a blind 
>> write because we’re already coordinating concurrent transactions 
>> through reads on the “revisions” subspace. This would be conceptually 
>> quite clean and simple, and the fastest implementation for constructing 
>> the default response.
>> 
>> ===
>> 
>> My sense is Option 2 is a non-starter but I include it for completeness 
>> in case anyone else thought of the same. I think Option 3 is a 
>> reasonable space / efficiency / simplicity tradeoff, and it might also 
>> be worth testing out Option 1 as an optimized implementation for 
>> include_docs=true.
>> 
>> Thoughts? I imagine we can move quickly to an RFC for at least having 
>> the extra KVs for Option 3, and in that design also acknowledge the 
>> option for scanning the docs space directly to support include_docs.
>> 
>> Adam

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/


Re: [DISCUSS] Implementing _all_docs on FoundationDB

Posted by Robert Newson <rn...@apache.org>.
Hi,

Thanks for pushing forward, and I owe feedback on other threads you've started.

Rather feebly, I'm just agreeing with you. option 3 for include_docs=false and option 1 for include_docs=true sounds ideal. both flavours are very common so it makes sense to build a solution for each. At a pinch we can just do option 3 + async doc lookups in a first release and then circle back, but the RFC should propose 1 and 3 as our design intention.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Thu, 21 Mar 2019, at 19:50, Adam Kocoloski wrote:
> Hi all, me again. This one will be shorter :) As I see it we have three 
> different options for serving the _all_docs endpoint from FDB: 
> 
> ## Option 1: Read the document data, discard the bodies
> 
> We likely will have the documents stored in docid order already; we 
> could do range reads and discard everything but the ID and _rev by 
> default. This can be a very efficient implementation of 
> include_docs=true (though one needs to be careful about skipping the 
> conflict bodies), but pretty wasteful otherwise.
> 
> ## Option 2: Read the “revisions” subspace
> 
> We also have an entry for every document in ID order in the “revisions” 
> subspace. The disadvantage of this approach is that every deleted edit 
> branch shows up there, too, and some databases will have lots of 
> deleted documents. We may need to build skiplists to know how to scan 
> efficiently. This subspace is also doing a lot of heavy lifting for us 
> already, and if we wanted to toy with alternative revision history 
> representations in the future it could get complicated
> 
> ## Option 3: Add specific entries to support _all_docs
> 
> We can also write an extra KV containing the ID and winning _rev in a 
> special subspace just to support this endpoint. It would be a blind 
> write because we’re already coordinating concurrent transactions 
> through reads on the “revisions” subspace. This would be conceptually 
> quite clean and simple, and the fastest implementation for constructing 
> the default response.
> 
> ===
> 
> My sense is Option 2 is a non-starter but I include it for completeness 
> in case anyone else thought of the same. I think Option 3 is a 
> reasonable space / efficiency / simplicity tradeoff, and it might also 
> be worth testing out Option 1 as an optimized implementation for 
> include_docs=true.
> 
> Thoughts? I imagine we can move quickly to an RFC for at least having 
> the extra KVs for Option 3, and in that design also acknowledge the 
> option for scanning the docs space directly to support include_docs.
> 
> Adam