You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Geoffrey Cox <re...@gmail.com> on 2018/01/22 15:27:38 UTC

Re: Shard level querying in CouchDB Proposal

Hey Mike,

I've been thinking more about your proposal above and when it is combined
with the new access-per-db enhancement it should greatly reduce the need
for db-per-user. One thing that I'm left wondering though is whether there
is consideration for different shard keys per doc. From what I gather in
your notes above, each doc would only have a single shard key and I think
implementing this alone will take significant work. However, if there was a
way to have multiple shard keys per doc then you could avoid having
duplicated data.

For example, assume a database of student work:

   1. Each doc has a `*username`* that corresponds with the owner of the doc
   2. Each doc has a `*classId`* that corresponds with the class for which
   the assignment was submitted

Ideally, you'd be able to issue a query with a shard key specific to the `
*username`* to get a student's work and yet another query with a shard key
specific to the `*classId` *to get the work from a teacher's
perspective. Would your proposal allow for something like this?

If not, I think you'd have to do something like duplicate the data, e.g.
add another doc that has the username of the teacher so that you could
query from the teacher's perspective. This of course could get pretty messy
when you consider more complicated scenarios as you could easily end up
with a lot of duplicated data.

Thanks!

Geoff

On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mr...@linux.vnet.ibm.com>
wrote:

>
> > On 25 Nov 2017, at 15:45, Adam Kocoloski <ko...@apache.org> wrote:
> >
> > Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments
> on the proposal:
> >
> >>      • For databases where this is enabled, every document needs a
> shard key.
> >
> > What would happen if this constraint were relaxed, and documents without
> a “:” in their ID simply used the full ID as the shard key as is done now?
>
> I think that practically it's not that awful. Documents without shard keys
> end up spread reasonably, albeit uncontrollably, across shards.
>
> But I think from a usability perspective, forcing this to be all or
> nothing for a database makes sense. It makes sure that every document in
> the database behaves the same way rather than having a bunch of stuff that
> behaves one way and a bunch of stuff that behaves a different way (i.e.,
> you can find some documents via shard local queries, whereas others are
> only visible at a global level).
>
> I think that if people want documents to behave that differently,
> enforcing different databases is helpful. It reinforces the point that
> these databases work well for use-cases where partitioning data using the
> shard key makes sense, which is a different method of data modelling than
> having one huge undifferentiated pool. Perhaps there are heretofore
> unthought of optimisations that only make sense if we can make this
> assumption too :)
>
> >
> >>      • Query results are restricted to documents with the shard key
> specified. Which makes things harder but leaves the door open for future
> things like shard-splitting without changing result sets. And it seems like
> what one would expect!
> >
> > I agree this is important. It took me a minute to remember the point
> here, which is that a query specifying a shard key needs to filter out
> results from different shard keys that happen to be colocated on the same
> shard.
> >
> > Does the current query functionality still work as it did before in a
> database without shard keys? That is, can I still issue a query without
> specifying a shard key and have it collate a response from the full
> dataset? I think this is worth addressing explicitly. My assumption is that
> it does, although I’m worried that there may be a problematic interaction
> if one tried to use the same physical index to satisfy both a “global”
> query and a query specifying a shard key.
>
> I think this is an interesting question.
>
> To start with, I guess the basic thing is that to efficiently use an index
> you'd imagine that you'd prefix the index's columns with the shard key --
> at least that's the thing I've been thinking, which likely means cleverer
> options are available :)
>
> My first thought is that the naive approach to filtering documents not
> matching a shard key is just that -- a node hosting a replica of a shard
> does a query on an index as normal and then there's some extra code that
> filters based on ID. Not actually super-awful -- we don't have to actually
> read the document itself for example -- but for any use-case where there
> are many shard keys associated with a given shard it feels like one can do
> better. But as long as the node querying the index is doing it, it feels
> pretty fast.
>
> I would wonder whether some more generally useful work on Mango could help
> reduce the amount of special case code going on:
>
> - Push index selection down to each shard.
> - Allow Mango to use multiple indexes to satisfy a query (even if this is
> simply for AND relationships).
>
> Then for any database with the shard key bit set true, the shards also
> create a JSON index based on the shard key, and we can append an `AND
> shardkey=foo` to the users' Mango selector. As our shard keys are in the
> doc ID, I don't think this is any faster at all. It would be if the shard
> key was more complicated, say a field in the doc, so we didn't have it to
> hand all the time. But it would certainly make the alteration for the shard
> local path much more contained and have very wide utility beyond this case.
>
> For views, I'm less sure there's anything smart you can do that doesn't
> add tonnes of overhead -- like making two indexes per view, one that's
> prefixed with the shard key and one which is not. This approach has all
> sorts of nasty interactions with things like reverse=true I imagine,
> however.
>
> Mike.
>

Re: Shard level querying in CouchDB Proposal

Posted by Geoffrey Cox <re...@gmail.com>.

I don’t think I have misunderstood the design. IMO, one of the most
difficult pieces of dealing with db-per-user is that you have to replicate
data between DBs. In the most simple example, if two users have access to
the same doc then that doc needs to be in both user’s DBs. The shard key
proposal is great, but it wouldn’t replace the need for this “replication”
as you’d still need say 2 docs in the same DB, each doc with a shard key
that corresponds to the user. In the real world, the use case gets even
more complicated so you may need even more types of “replications”

The shard key proposal should go a long way with certain designs, but I
don’t think it will help much with this “replication” complexity present
with db-per-user as these “replications” would still be needed in a
shard-key design.
On Mon, Apr 16, 2018 at 2:50 AM Mike Rhodes <mr...@linux.vnet.ibm.com>
wrote:

>
> > On 8 Feb 2018, at 17:22, Geoffrey Cox <re...@gmail.com> wrote:
> >
> > Thanks for the clarification, Mike. I figured that having multiple shard
> > keys per doc would be a big change, but I was hoping I was wrong ;). I
> > still think your proposed solution will add a lot of value to CouchDB.
> > Unfortunately, it isn't going to be part of the silver bullet that makes
> it
> > feasible to step away from db-per-user in many cases as in the end you
> need
> > shards/partitions that correspond with your user's queries or else it
> will
> > require visiting each shard when a query is issued, something that
> wouldn't
> > be very scalable with a very large database with lots of nodes.
> >
> > In my mind, either the database has to "replicate" the data to the right
> > shards/partitions or you have to do this manually.
>
> I think there's a misunderstanding somewhere here. In terms of query, it
> should totally allow moving away from db-per-user -- db-per-user is
> essentially the same as a shard key being the user name (with the caveat
> that the direct equivalent is that number of shards per user database is 1,
> which places limitations on the amount of data per user to the amount 1
> shard can hold).
>
> Mike.
>

Re: Shard level querying in CouchDB Proposal

Posted by Mike Rhodes <mr...@linux.vnet.ibm.com>.

> On 8 Feb 2018, at 17:22, Geoffrey Cox <re...@gmail.com> wrote:
> 
> Thanks for the clarification, Mike. I figured that having multiple shard
> keys per doc would be a big change, but I was hoping I was wrong ;). I
> still think your proposed solution will add a lot of value to CouchDB.
> Unfortunately, it isn't going to be part of the silver bullet that makes it
> feasible to step away from db-per-user in many cases as in the end you need
> shards/partitions that correspond with your user's queries or else it will
> require visiting each shard when a query is issued, something that wouldn't
> be very scalable with a very large database with lots of nodes.
> 
> In my mind, either the database has to "replicate" the data to the right
> shards/partitions or you have to do this manually.

I think there's a misunderstanding somewhere here. In terms of query, it should totally allow moving away from db-per-user -- db-per-user is essentially the same as a shard key being the user name (with the caveat that the direct equivalent is that number of shards per user database is 1, which places limitations on the amount of data per user to the amount 1 shard can hold).

Mike.

Re: Shard level querying in CouchDB Proposal

Posted by Geoffrey Cox <re...@gmail.com>.

Thanks for the clarification, Mike. I figured that having multiple shard
keys per doc would be a big change, but I was hoping I was wrong ;). I
still think your proposed solution will add a lot of value to CouchDB.
Unfortunately, it isn't going to be part of the silver bullet that makes it
feasible to step away from db-per-user in many cases as in the end you need
shards/partitions that correspond with your user's queries or else it will
require visiting each shard when a query is issued, something that wouldn't
be very scalable with a very large database with lots of nodes.

In my mind, either the database has to "replicate" the data to the right
shards/partitions or you have to do this manually.

On Wed, Feb 7, 2018 at 2:43 AM Mike Rhodes <mr...@linux.vnet.ibm.com>
wrote:

> Geoff,
>
> Apologies for taking ages to reply.
>
> You can only have a single shard key per document, because the shard key
> directly affects where the document ends up being written to disk and,
> modulo shard replicas, a document is only in one place within the primary
> data shards.
>
> I think what you are thinking about is a major change to global view
> queries (i.e., what couchdb has now) which affects how indexes are sharded.
> Instead of an index shard being essentially the index for a given primary
> data shard, you shard the index based on the keys emitted in the view map
> function. This enables global view queries to be directed to single shard
> based on the keys involved.
>
> This is a massive change, and probably affects a bunch of the assumptions
> built into the way queries are executed by the clustering layer as well as
> the actual mechanics of indexing primary data on one node but writing
> indexes data to another node (as the shard key for a document is no longer
> the same as a shard key for the indexed data, which is implicitly is in the
> current clustering model).
>
> This proposal avoids this kettle of fish, but it is a really nice way to
> speed up queries in a simpler global model. I think it ends up being
> something that works well with my proposal but which is pretty orthogonal
> in terms of implementation.
>
> Mike.
>
> > On 23 Jan 2018, at 20:27, Adam Kocoloski <ko...@apache.org> wrote:
> >
> > The way I understand the proposal, you could satisfy at most one of
> those requests (probably the *username* one) with a local query. The other
> one would have to be a global query, but the proposal does allow for a mix
> of local and global queries against the same dataset.
> >
> > Adam
> >
> >> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <re...@gmail.com> wrote:
> >>
> >> Hey Mike,
> >>
> >> I've been thinking more about your proposal above and when it is
> combined
> >> with the new access-per-db enhancement it should greatly reduce the need
> >> for db-per-user. One thing that I'm left wondering though is whether
> there
> >> is consideration for different shard keys per doc. From what I gather in
> >> your notes above, each doc would only have a single shard key and I
> think
> >> implementing this alone will take significant work. However, if there
> was a
> >> way to have multiple shard keys per doc then you could avoid having
> >> duplicated data.
> >>
> >> For example, assume a database of student work:
> >>
> >>  1. Each doc has a `*username`* that corresponds with the owner of the
> doc
> >>  2. Each doc has a `*classId`* that corresponds with the class for which
> >>  the assignment was submitted
> >>
> >> Ideally, you'd be able to issue a query with a shard key specific to
> the `
> >> *username`* to get a student's work and yet another query with a shard
> key
> >> specific to the `*classId` *to get the work from a teacher's
> >> perspective. Would your proposal allow for something like this?
> >>
> >> If not, I think you'd have to do something like duplicate the data, e.g.
> >> add another doc that has the username of the teacher so that you could
> >> query from the teacher's perspective. This of course could get pretty
> messy
> >> when you consider more complicated scenarios as you could easily end up
> >> with a lot of duplicated data.
> >>
> >> Thanks!
> >>
> >> Geoff
> >>
> >> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrhodes@linux.vnet.ibm.com
> >
> >> wrote:
> >>
> >>>
> >>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <ko...@apache.org> wrote:
> >>>>
> >>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of
> comments
> >>> on the proposal:
> >>>>
> >>>>>    • For databases where this is enabled, every document needs a
> >>> shard key.
> >>>>
> >>>> What would happen if this constraint were relaxed, and documents
> without
> >>> a “:” in their ID simply used the full ID as the shard key as is done
> now?
> >>>
> >>> I think that practically it's not that awful. Documents without shard
> keys
> >>> end up spread reasonably, albeit uncontrollably, across shards.
> >>>
> >>> But I think from a usability perspective, forcing this to be all or
> >>> nothing for a database makes sense. It makes sure that every document
> in
> >>> the database behaves the same way rather than having a bunch of stuff
> that
> >>> behaves one way and a bunch of stuff that behaves a different way
> (i.e.,
> >>> you can find some documents via shard local queries, whereas others are
> >>> only visible at a global level).
> >>>
> >>> I think that if people want documents to behave that differently,
> >>> enforcing different databases is helpful. It reinforces the point that
> >>> these databases work well for use-cases where partitioning data using
> the
> >>> shard key makes sense, which is a different method of data modelling
> than
> >>> having one huge undifferentiated pool. Perhaps there are heretofore
> >>> unthought of optimisations that only make sense if we can make this
> >>> assumption too :)
> >>>
> >>>>
> >>>>>    • Query results are restricted to documents with the shard key
> >>> specified. Which makes things harder but leaves the door open for
> future
> >>> things like shard-splitting without changing result sets. And it seems
> like
> >>> what one would expect!
> >>>>
> >>>> I agree this is important. It took me a minute to remember the point
> >>> here, which is that a query specifying a shard key needs to filter out
> >>> results from different shard keys that happen to be colocated on the
> same
> >>> shard.
> >>>>
> >>>> Does the current query functionality still work as it did before in a
> >>> database without shard keys? That is, can I still issue a query without
> >>> specifying a shard key and have it collate a response from the full
> >>> dataset? I think this is worth addressing explicitly. My assumption is
> that
> >>> it does, although I’m worried that there may be a problematic
> interaction
> >>> if one tried to use the same physical index to satisfy both a “global”
> >>> query and a query specifying a shard key.
> >>>
> >>> I think this is an interesting question.
> >>>
> >>> To start with, I guess the basic thing is that to efficiently use an
> index
> >>> you'd imagine that you'd prefix the index's columns with the shard key
> --
> >>> at least that's the thing I've been thinking, which likely means
> cleverer
> >>> options are available :)
> >>>
> >>> My first thought is that the naive approach to filtering documents not
> >>> matching a shard key is just that -- a node hosting a replica of a
> shard
> >>> does a query on an index as normal and then there's some extra code
> that
> >>> filters based on ID. Not actually super-awful -- we don't have to
> actually
> >>> read the document itself for example -- but for any use-case where
> there
> >>> are many shard keys associated with a given shard it feels like one
> can do
> >>> better. But as long as the node querying the index is doing it, it
> feels
> >>> pretty fast.
> >>>
> >>> I would wonder whether some more generally useful work on Mango could
> help
> >>> reduce the amount of special case code going on:
> >>>
> >>> - Push index selection down to each shard.
> >>> - Allow Mango to use multiple indexes to satisfy a query (even if this
> is
> >>> simply for AND relationships).
> >>>
> >>> Then for any database with the shard key bit set true, the shards also
> >>> create a JSON index based on the shard key, and we can append an `AND
> >>> shardkey=foo` to the users' Mango selector. As our shard keys are in
> the
> >>> doc ID, I don't think this is any faster at all. It would be if the
> shard
> >>> key was more complicated, say a field in the doc, so we didn't have it
> to
> >>> hand all the time. But it would certainly make the alteration for the
> shard
> >>> local path much more contained and have very wide utility beyond this
> case.
> >>>
> >>> For views, I'm less sure there's anything smart you can do that doesn't
> >>> add tonnes of overhead -- like making two indexes per view, one that's
> >>> prefixed with the shard key and one which is not. This approach has all
> >>> sorts of nasty interactions with things like reverse=true I imagine,
> >>> however.
> >>>
> >>> Mike.
> >>>
> >
>
>

Re: Shard level querying in CouchDB Proposal

Posted by Mike Rhodes <mr...@linux.vnet.ibm.com>.

Geoff,

Apologies for taking ages to reply.

You can only have a single shard key per document, because the shard key directly affects where the document ends up being written to disk and, modulo shard replicas, a document is only in one place within the primary data shards.

I think what you are thinking about is a major change to global view queries (i.e., what couchdb has now) which affects how indexes are sharded. Instead of an index shard being essentially the index for a given primary data shard, you shard the index based on the keys emitted in the view map function. This enables global view queries to be directed to single shard based on the keys involved.

This is a massive change, and probably affects a bunch of the assumptions built into the way queries are executed by the clustering layer as well as the actual mechanics of indexing primary data on one node but writing indexes data to another node (as the shard key for a document is no longer the same as a shard key for the indexed data, which is implicitly is in the current clustering model).

This proposal avoids this kettle of fish, but it is a really nice way to speed up queries in a simpler global model. I think it ends up being something that works well with my proposal but which is pretty orthogonal in terms of implementation.

Mike.

> On 23 Jan 2018, at 20:27, Adam Kocoloski <ko...@apache.org> wrote:
> 
> The way I understand the proposal, you could satisfy at most one of those requests (probably the *username* one) with a local query. The other one would have to be a global query, but the proposal does allow for a mix of local and global queries against the same dataset.
> 
> Adam
> 
>> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <re...@gmail.com> wrote:
>> 
>> Hey Mike,
>> 
>> I've been thinking more about your proposal above and when it is combined
>> with the new access-per-db enhancement it should greatly reduce the need
>> for db-per-user. One thing that I'm left wondering though is whether there
>> is consideration for different shard keys per doc. From what I gather in
>> your notes above, each doc would only have a single shard key and I think
>> implementing this alone will take significant work. However, if there was a
>> way to have multiple shard keys per doc then you could avoid having
>> duplicated data.
>> 
>> For example, assume a database of student work:
>> 
>>  1. Each doc has a `*username`* that corresponds with the owner of the doc
>>  2. Each doc has a `*classId`* that corresponds with the class for which
>>  the assignment was submitted
>> 
>> Ideally, you'd be able to issue a query with a shard key specific to the `
>> *username`* to get a student's work and yet another query with a shard key
>> specific to the `*classId` *to get the work from a teacher's
>> perspective. Would your proposal allow for something like this?
>> 
>> If not, I think you'd have to do something like duplicate the data, e.g.
>> add another doc that has the username of the teacher so that you could
>> query from the teacher's perspective. This of course could get pretty messy
>> when you consider more complicated scenarios as you could easily end up
>> with a lot of duplicated data.
>> 
>> Thanks!
>> 
>> Geoff
>> 
>> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mr...@linux.vnet.ibm.com>
>> wrote:
>> 
>>> 
>>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <ko...@apache.org> wrote:
>>>> 
>>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments
>>> on the proposal:
>>>> 
>>>>>    • For databases where this is enabled, every document needs a
>>> shard key.
>>>> 
>>>> What would happen if this constraint were relaxed, and documents without
>>> a “:” in their ID simply used the full ID as the shard key as is done now?
>>> 
>>> I think that practically it's not that awful. Documents without shard keys
>>> end up spread reasonably, albeit uncontrollably, across shards.
>>> 
>>> But I think from a usability perspective, forcing this to be all or
>>> nothing for a database makes sense. It makes sure that every document in
>>> the database behaves the same way rather than having a bunch of stuff that
>>> behaves one way and a bunch of stuff that behaves a different way (i.e.,
>>> you can find some documents via shard local queries, whereas others are
>>> only visible at a global level).
>>> 
>>> I think that if people want documents to behave that differently,
>>> enforcing different databases is helpful. It reinforces the point that
>>> these databases work well for use-cases where partitioning data using the
>>> shard key makes sense, which is a different method of data modelling than
>>> having one huge undifferentiated pool. Perhaps there are heretofore
>>> unthought of optimisations that only make sense if we can make this
>>> assumption too :)
>>> 
>>>> 
>>>>>    • Query results are restricted to documents with the shard key
>>> specified. Which makes things harder but leaves the door open for future
>>> things like shard-splitting without changing result sets. And it seems like
>>> what one would expect!
>>>> 
>>>> I agree this is important. It took me a minute to remember the point
>>> here, which is that a query specifying a shard key needs to filter out
>>> results from different shard keys that happen to be colocated on the same
>>> shard.
>>>> 
>>>> Does the current query functionality still work as it did before in a
>>> database without shard keys? That is, can I still issue a query without
>>> specifying a shard key and have it collate a response from the full
>>> dataset? I think this is worth addressing explicitly. My assumption is that
>>> it does, although I’m worried that there may be a problematic interaction
>>> if one tried to use the same physical index to satisfy both a “global”
>>> query and a query specifying a shard key.
>>> 
>>> I think this is an interesting question.
>>> 
>>> To start with, I guess the basic thing is that to efficiently use an index
>>> you'd imagine that you'd prefix the index's columns with the shard key --
>>> at least that's the thing I've been thinking, which likely means cleverer
>>> options are available :)
>>> 
>>> My first thought is that the naive approach to filtering documents not
>>> matching a shard key is just that -- a node hosting a replica of a shard
>>> does a query on an index as normal and then there's some extra code that
>>> filters based on ID. Not actually super-awful -- we don't have to actually
>>> read the document itself for example -- but for any use-case where there
>>> are many shard keys associated with a given shard it feels like one can do
>>> better. But as long as the node querying the index is doing it, it feels
>>> pretty fast.
>>> 
>>> I would wonder whether some more generally useful work on Mango could help
>>> reduce the amount of special case code going on:
>>> 
>>> - Push index selection down to each shard.
>>> - Allow Mango to use multiple indexes to satisfy a query (even if this is
>>> simply for AND relationships).
>>> 
>>> Then for any database with the shard key bit set true, the shards also
>>> create a JSON index based on the shard key, and we can append an `AND
>>> shardkey=foo` to the users' Mango selector. As our shard keys are in the
>>> doc ID, I don't think this is any faster at all. It would be if the shard
>>> key was more complicated, say a field in the doc, so we didn't have it to
>>> hand all the time. But it would certainly make the alteration for the shard
>>> local path much more contained and have very wide utility beyond this case.
>>> 
>>> For views, I'm less sure there's anything smart you can do that doesn't
>>> add tonnes of overhead -- like making two indexes per view, one that's
>>> prefixed with the shard key and one which is not. This approach has all
>>> sorts of nasty interactions with things like reverse=true I imagine,
>>> however.
>>> 
>>> Mike.
>>> 
>

Re: Shard level querying in CouchDB Proposal

Posted by Adam Kocoloski <ko...@apache.org>.

The way I understand the proposal, you could satisfy at most one of those requests (probably the *username* one) with a local query. The other one would have to be a global query, but the proposal does allow for a mix of local and global queries against the same dataset.

Adam

> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <re...@gmail.com> wrote:
> 
> Hey Mike,
> 
> I've been thinking more about your proposal above and when it is combined
> with the new access-per-db enhancement it should greatly reduce the need
> for db-per-user. One thing that I'm left wondering though is whether there
> is consideration for different shard keys per doc. From what I gather in
> your notes above, each doc would only have a single shard key and I think
> implementing this alone will take significant work. However, if there was a
> way to have multiple shard keys per doc then you could avoid having
> duplicated data.
> 
> For example, assume a database of student work:
> 
>   1. Each doc has a `*username`* that corresponds with the owner of the doc
>   2. Each doc has a `*classId`* that corresponds with the class for which
>   the assignment was submitted
> 
> Ideally, you'd be able to issue a query with a shard key specific to the `
> *username`* to get a student's work and yet another query with a shard key
> specific to the `*classId` *to get the work from a teacher's
> perspective. Would your proposal allow for something like this?
> 
> If not, I think you'd have to do something like duplicate the data, e.g.
> add another doc that has the username of the teacher so that you could
> query from the teacher's perspective. This of course could get pretty messy
> when you consider more complicated scenarios as you could easily end up
> with a lot of duplicated data.
> 
> Thanks!
> 
> Geoff
> 
> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mr...@linux.vnet.ibm.com>
> wrote:
> 
>> 
>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <ko...@apache.org> wrote:
>>> 
>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments
>> on the proposal:
>>> 
>>>>     • For databases where this is enabled, every document needs a
>> shard key.
>>> 
>>> What would happen if this constraint were relaxed, and documents without
>> a “:” in their ID simply used the full ID as the shard key as is done now?
>> 
>> I think that practically it's not that awful. Documents without shard keys
>> end up spread reasonably, albeit uncontrollably, across shards.
>> 
>> But I think from a usability perspective, forcing this to be all or
>> nothing for a database makes sense. It makes sure that every document in
>> the database behaves the same way rather than having a bunch of stuff that
>> behaves one way and a bunch of stuff that behaves a different way (i.e.,
>> you can find some documents via shard local queries, whereas others are
>> only visible at a global level).
>> 
>> I think that if people want documents to behave that differently,
>> enforcing different databases is helpful. It reinforces the point that
>> these databases work well for use-cases where partitioning data using the
>> shard key makes sense, which is a different method of data modelling than
>> having one huge undifferentiated pool. Perhaps there are heretofore
>> unthought of optimisations that only make sense if we can make this
>> assumption too :)
>> 
>>> 
>>>>     • Query results are restricted to documents with the shard key
>> specified. Which makes things harder but leaves the door open for future
>> things like shard-splitting without changing result sets. And it seems like
>> what one would expect!
>>> 
>>> I agree this is important. It took me a minute to remember the point
>> here, which is that a query specifying a shard key needs to filter out
>> results from different shard keys that happen to be colocated on the same
>> shard.
>>> 
>>> Does the current query functionality still work as it did before in a
>> database without shard keys? That is, can I still issue a query without
>> specifying a shard key and have it collate a response from the full
>> dataset? I think this is worth addressing explicitly. My assumption is that
>> it does, although I’m worried that there may be a problematic interaction
>> if one tried to use the same physical index to satisfy both a “global”
>> query and a query specifying a shard key.
>> 
>> I think this is an interesting question.
>> 
>> To start with, I guess the basic thing is that to efficiently use an index
>> you'd imagine that you'd prefix the index's columns with the shard key --
>> at least that's the thing I've been thinking, which likely means cleverer
>> options are available :)
>> 
>> My first thought is that the naive approach to filtering documents not
>> matching a shard key is just that -- a node hosting a replica of a shard
>> does a query on an index as normal and then there's some extra code that
>> filters based on ID. Not actually super-awful -- we don't have to actually
>> read the document itself for example -- but for any use-case where there
>> are many shard keys associated with a given shard it feels like one can do
>> better. But as long as the node querying the index is doing it, it feels
>> pretty fast.
>> 
>> I would wonder whether some more generally useful work on Mango could help
>> reduce the amount of special case code going on:
>> 
>> - Push index selection down to each shard.
>> - Allow Mango to use multiple indexes to satisfy a query (even if this is
>> simply for AND relationships).
>> 
>> Then for any database with the shard key bit set true, the shards also
>> create a JSON index based on the shard key, and we can append an `AND
>> shardkey=foo` to the users' Mango selector. As our shard keys are in the
>> doc ID, I don't think this is any faster at all. It would be if the shard
>> key was more complicated, say a field in the doc, so we didn't have it to
>> hand all the time. But it would certainly make the alteration for the shard
>> local path much more contained and have very wide utility beyond this case.
>> 
>> For views, I'm less sure there's anything smart you can do that doesn't
>> add tonnes of overhead -- like making two indexes per view, one that's
>> prefixed with the shard key and one which is not. This approach has all
>> sorts of nasty interactions with things like reverse=true I imagine,
>> however.
>> 
>> Mike.
>>