You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Joan Touzet <wo...@apache.org> on 2018/08/01 13:07:26 UTC

Re: Shard level querying in CouchDB Proposal

Hi everyone,

Recently, Garren and Robert started making progress on this proposal via a PR:

https://github.com/apache/couchdb/pull/1480

and specifically:

https://github.com/apache/couchdb/pull/1480#issuecomment-409565736

which has lead to me with a strong -0.75 on the proposed endpoint of:

    /db/_partition/:partitionkey/_designdoc/name/_view/viewname 

Here's why.

We absolutely must get rid of port 5986, which is currently the only way to get to actual disk-level shards in CouchDB today. The route for that will probably look something like:

    /db/_shard/00000000-1fffffff/...

This is critical for cluster-level administration, health checks, etc. and to fully remove the old couch_httpd code from the codebase (which is desperately overdue for happening, and must happen prior to a 3.0 release). I'm sad we don't have this code yet, especially since like children in a well-stocked larder we're rushing to the jams and pies before having our main courses, but such is the nature of shiny things.

Now I see we are introducing view partitions, which to me really should be below the view portion here:

    /db/_designdoc/name/_view/viewname/_partition/:partitionkey/

End users who are new to CouchDB 2.x are still just learning about shards. Partitions are only going to further muddy the waters.

As I said on IRC, "i 100% guarantee you that people will not understand the difference between a db shard and a db partition if we introduce both concepts without careful thought :)" To me, this is not carefully thought out.

Garren mentions this will also surface for find/index, and thus makes a case for it being farther up. But I argue that with /db/_shard and /db/_partition people will have no idea what they are doing.

Please help me disentangle this ball of yarn. And don't make the new feature "more important" than shard-level access, it's not.

-Joan

Re: Shard level querying in CouchDB Proposal

Posted by Mike Rhodes <co...@dx13.co.uk>.
Joan,

I'd like to reassure that the question has been considered in reasonable depth. My wording was looser than it should have been.

However, the solution is non-obvious enough that a single "correct" way forward probably doesn't exist and reasonable people can argue in many directions. A perfect solution isn't possible within the constraints of the CouchDB API IMO, which leaves us with finding the least-bad solution in many ways, and trying to choose one where a path is open for improving longer term and reducing the necessary level of compromise.

For example, the move from calling it "shard local querying" (in the original proposal to the list) to "partitioned databases" was in large part driven directly from usability concerns and user mental model confusion -- and a desire to separate out via terminology what is more a developer level concern (logical data partitions) from admin level concerns (physical shard layout) such that the developer's abstraction has a clean and non-leaky mapping to what an admin is managing.

As to the exact API, my view is that making the feature very front-and-centre is less likely to be confusing than it being "hidden" lower down in the API structure, as it's an important concept to get straight early on if your use-case requires it. This also necessarily brings documentation further to the fore as we can't relegate discussion to the API reference as easily, which I also view as a positive.

-- 
Mike.

On Thu, 2 Aug 2018, at 12:17, Joan Touzet wrote:
> > I do agree with the confusion aspect of shards and partitions, and
> > I'm unsure exactly the way forward here yet :(
> 
> This is all I care about, and I'm cross that this hasn't been given
> more serious consideration, given CouchDB is already confusing for people
> coming from other databases. Make all the new features you want, but not
> at the expense of usability.
> 
> I've raised the issue but I don't have any great idea, other than to say
> I think /db/_shard/<range> is a suitable place for shard-specific operations
> to happen.
> 
> Help from this list is important before the new endpoint lands.
> 
> -Joan

Re: Shard level querying in CouchDB Proposal

Posted by Jan Lehnardt <ja...@apache.org>.

> On 2. Aug 2018, at 15:37, Garren Smith <ga...@apache.org> wrote:
> 
> There was a discussion on #couchdb-dev and the proposal is that instead of
> using _shards we would rather have _nodes so we could requests like /_node/
> couchdb@192.168.0.3/shards%2f0000-1ffff%2ffoo/docid

Clarification: I was proposing to re-use the existing /_node endpoint, not add a new one :)

Best
Jan
—
> 
> That then allows us to use _partition as an endpoint for the partition work
> without us confusing end users with _shard and _partition.
> 
> Full IRC conversation below:
> 
> 14:25 W<+Wohali> i'd say /_node/$name/$db/_shards instead though
> 14:25 W<+Wohali> from the general to the specific?
> 14:45 R<@rnewson> I don't mind /_node too much
> 14:49 J<+jan____> Wohali: I’m thinking more /_node/$x/ somewhat analogous
> to 5986 conceptually, so everything on there is per-node-only, so no notion
> of a $db, and only shards
> 14:49 J<+jan____> very nitpick of course
> 14:49 R<@rnewson> ah sorry, I am not even thinking
> 14:50 R<@rnewson>  /_node/
> couchdb@192.168.0.3/shards%2f0000-1ffff%2ffoo/docid/attname (etc)
> 14:50 R<@rnewson> no need for _shards magic
> 14:50 R<@rnewson> the shards have proper database names
> 14:50 W<+Wohali> that's in line with our original thinking that under
> /_node/<nodename> was, effectively, couch_httpd
> 14:51 R<@rnewson> that is /_node/$nodename/$dbname where $dbname is what
> you'd use on port 5986 (the %2f encoding et al)
> 14:51 W<+Wohali> which would be the easiest way to get rid of 5986
> 14:51 R<@rnewson> right
> 14:51 R<@rnewson> yes
> 14:51 R<@rnewson> it would allow us to close the port but not (yet) delete
> couch_httpd_* modules
> 14:51 W<+Wohali> and makes a doc rewrite easy. so yeah, i like that
> 14:52 R<@rnewson> the things already exposed under /_node/$node don't mess
> us up? I think we were careful but it's been a while
> 14:52 R<@rnewson> in that I think it was just two global handlers
> 14:52 W<+Wohali> _stats and _system for sure are there
> 14:53 R<@rnewson> ok, those fit perfectly
> 14:53 W<+Wohali> _config too
> 14:53 R<@rnewson> ditto
> 14:53 W<+Wohali> we'll want _dbs and _nodes there
> 14:53 R<@rnewson> ok, so it really could just be everything
> 
> On Thu, Aug 2, 2018 at 1:17 PM, Joan Touzet <wo...@apache.org> wrote:
> 
>>> I do agree with the confusion aspect of shards and partitions, and
>>> I'm unsure exactly the way forward here yet :(
>> 
>> This is all I care about, and I'm cross that this hasn't been given
>> more serious consideration, given CouchDB is already confusing for people
>> coming from other databases. Make all the new features you want, but not
>> at the expense of usability.
>> 
>> I've raised the issue but I don't have any great idea, other than to say
>> I think /db/_shard/<range> is a suitable place for shard-specific
>> operations
>> to happen.
>> 
>> Help from this list is important before the new endpoint lands.
>> 
>> -Joan
>> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/


Re: Shard level querying in CouchDB Proposal

Posted by Garren Smith <ga...@apache.org>.
There was a discussion on #couchdb-dev and the proposal is that instead of
using _shards we would rather have _nodes so we could requests like /_node/
couchdb@192.168.0.3/shards%2f0000-1ffff%2ffoo/docid

That then allows us to use _partition as an endpoint for the partition work
without us confusing end users with _shard and _partition.

Full IRC conversation below:

14:25 W<+Wohali> i'd say /_node/$name/$db/_shards instead though
14:25 W<+Wohali> from the general to the specific?
14:45 R<@rnewson> I don't mind /_node too much
14:49 J<+jan____> Wohali: I’m thinking more /_node/$x/ somewhat analogous
to 5986 conceptually, so everything on there is per-node-only, so no notion
of a $db, and only shards
14:49 J<+jan____> very nitpick of course
14:49 R<@rnewson> ah sorry, I am not even thinking
14:50 R<@rnewson>  /_node/
couchdb@192.168.0.3/shards%2f0000-1ffff%2ffoo/docid/attname (etc)
14:50 R<@rnewson> no need for _shards magic
14:50 R<@rnewson> the shards have proper database names
14:50 W<+Wohali> that's in line with our original thinking that under
/_node/<nodename> was, effectively, couch_httpd
14:51 R<@rnewson> that is /_node/$nodename/$dbname where $dbname is what
you'd use on port 5986 (the %2f encoding et al)
14:51 W<+Wohali> which would be the easiest way to get rid of 5986
14:51 R<@rnewson> right
14:51 R<@rnewson> yes
14:51 R<@rnewson> it would allow us to close the port but not (yet) delete
couch_httpd_* modules
14:51 W<+Wohali> and makes a doc rewrite easy. so yeah, i like that
14:52 R<@rnewson> the things already exposed under /_node/$node don't mess
us up? I think we were careful but it's been a while
14:52 R<@rnewson> in that I think it was just two global handlers
14:52 W<+Wohali> _stats and _system for sure are there
14:53 R<@rnewson> ok, those fit perfectly
14:53 W<+Wohali> _config too
14:53 R<@rnewson> ditto
14:53 W<+Wohali> we'll want _dbs and _nodes there
14:53 R<@rnewson> ok, so it really could just be everything

On Thu, Aug 2, 2018 at 1:17 PM, Joan Touzet <wo...@apache.org> wrote:

> > I do agree with the confusion aspect of shards and partitions, and
> > I'm unsure exactly the way forward here yet :(
>
> This is all I care about, and I'm cross that this hasn't been given
> more serious consideration, given CouchDB is already confusing for people
> coming from other databases. Make all the new features you want, but not
> at the expense of usability.
>
> I've raised the issue but I don't have any great idea, other than to say
> I think /db/_shard/<range> is a suitable place for shard-specific
> operations
> to happen.
>
> Help from this list is important before the new endpoint lands.
>
> -Joan
>

Re: Shard level querying in CouchDB Proposal

Posted by Joan Touzet <wo...@apache.org>.
> I do agree with the confusion aspect of shards and partitions, and
> I'm unsure exactly the way forward here yet :(

This is all I care about, and I'm cross that this hasn't been given
more serious consideration, given CouchDB is already confusing for people
coming from other databases. Make all the new features you want, but not
at the expense of usability.

I've raised the issue but I don't have any great idea, other than to say
I think /db/_shard/<range> is a suitable place for shard-specific operations
to happen.

Help from this list is important before the new endpoint lands.

-Joan

Re: Shard level querying in CouchDB Proposal

Posted by Mike Rhodes <co...@dx13.co.uk>.
Joan,

I'm in agreement that this feature isn't "more important" -- but neither is it "less important". They're both vital for different sets of users (app devs vs. admins I think).

Disclosure: this URL scheme was mostly my thinking and is partly based around a future that is more partition aware.

For users with large databases -- think tens/hundreds of shards -- this is vitally important and front-and-centre in the data model and development mindset. Therefore, I think my argument will in part be the same as yours -- that I think partitions are, for a developer, a new but totally first-class aspect of the data model, in the same way that shards are for a more admin-based perspective. So I think both _shards and _partitions make sense, and both concepts are important to developers and admins -- different people, or the same people wearing different hats at different times.

This is because it's the primary data that's partitioned, its not just a view index partition or a Mango partition. From this I see a partition as a further logical subdivision of documents within a CouchDB instance -- in the same way that a database is a logical subdivision of documents.

Therefore having partition as a first-class part of the URL rather than a secondary part makes sense to me. The CouchDB path hierarchy currently is of the form /<data subdivision>/<index within data subdivision> which in the world view above implies the logical /<data subdivision>/<further data subdivision>/<index within data subdivision> to maintain consistency.

I will admit that there is a certain awkwardness in shoe-horning this new concept onto an existing API (e.g., should /db/_partition/foo/dockey do a document GET? should POST /db/_partition/foo auto-generate a dockey?) but I feel that having the defined namespace allows us to make those choices without radically changing the API and would allow future expansion of the first-class nature of this API.

In addition, when developing I think it makes sense to a user (well, it does to me anyway) that we can have the notion of "requests made to endpoints under the _partition namespace are more performant and preferred for large scale databases" being easier to consume than "you can use endpoints X, Y, Z in a scalable manner if you also provide this bit of path on the end". As new endpoints become partition aware -- if that makes sense, which I suspect the will end up being, not least something like /db/_partition/foo/_info for partition size, doc count etc. -- they have a natural place to live within the existing path hierarchy.

I do agree with the confusion aspect of shards and partitions, and I'm unsure exactly the way forward here yet :(

-- 
Mike.

On Wed, 1 Aug 2018, at 14:07, Joan Touzet wrote:
> Hi everyone,
> 
> Recently, Garren and Robert started making progress on this proposal via a PR:
> 
> https://github.com/apache/couchdb/pull/1480
> 
> and specifically:
> 
> https://github.com/apache/couchdb/pull/1480#issuecomment-409565736
> 
> which has lead to me with a strong -0.75 on the proposed endpoint of:
> 
>     /db/_partition/:partitionkey/_designdoc/name/_view/viewname 
> 
> Here's why.
> 
> We absolutely must get rid of port 5986, which is currently the only way 
> to get to actual disk-level shards in CouchDB today. The route for that 
> will probably look something like:
> 
>     /db/_shard/00000000-1fffffff/...
> 
> This is critical for cluster-level administration, health checks, etc. 
> and to fully remove the old couch_httpd code from the codebase (which is 
> desperately overdue for happening, and must happen prior to a 3.0 
> release). I'm sad we don't have this code yet, especially since like 
> children in a well-stocked larder we're rushing to the jams and pies 
> before having our main courses, but such is the nature of shiny things.
> 
> Now I see we are introducing view partitions, which to me really should 
> be below the view portion here:
> 
>     /db/_designdoc/name/_view/viewname/_partition/:partitionkey/
> 
> End users who are new to CouchDB 2.x are still just learning about 
> shards. Partitions are only going to further muddy the waters.
> 
> As I said on IRC, "i 100% guarantee you that people will not understand 
> the difference between a db shard and a db partition if we introduce 
> both concepts without careful thought :)" To me, this is not carefully 
> thought out.
> 
> Garren mentions this will also surface for find/index, and thus makes a 
> case for it being farther up. But I argue that with /db/_shard and /db/
> _partition people will have no idea what they are doing.
> 
> Please help me disentangle this ball of yarn. And don't make the new 
> feature "more important" than shard-level access, it's not.
> 
> -Joan