You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2022/05/31 08:21:58 UTC

[GitHub] [couchdb] nkosi23 opened a new issue, #4044: Add support for collections

nkosi23 opened a new issue, #4044:
URL: https://github.com/apache/couchdb/issues/4044

[NOTE]: # ( ^^ Provide a general summary of the request in the title above. ^^ )

## Summary

Map/reduce indexes may take a lot of time to be built on very large databases. In addition to working on the Query Server, another way to mitigate this problem would be to reduce the number of documents that a view needs to process. As least for me, each of my views begin with an "if" statement to check the document type, so it would probably make sense to give couchdb a way to know that only specific documents need to be passed to a certain view.

## Desired Behaviour

[NOTE]: # ( Tell us how the new feature should work. Be specific. )
[TIP]: # ( Do NOT give us access or passwords to your actual CouchDB! )

Exactly the same API than the one offered by the _partitions feature.

## Possible Solution

[NOTE]: # ( Not required. Suggest how to implement the addition or change. )
I believe we can take inspiration from the implementation of partition. In particular the fact that partitions can have their own design documents only applying to the specific partition is exactly what is needed here. This is exactly the spirit of this proposal. Unfortunately there is a major caveat making partitions unsuitable for this use case, as per the documentation:

> A good partition has two basic properties. First, it should have a high cardinality. That is, a large partitioned database should have many more partitions than documents in any single partition. A database that has a single partition would be an anti-pattern for this feature. Secondly, the amount of data per partition should be “small”. The general recommendation is to limit individual partitions to less than ten gigabytes (10 GB) of data. Which, for the example of sensor documents, equates to roughly 60,000 years of data.

In the use case I describe, there would instead be a small number of partitions each having a large number of documents. This would therefore be the main difference between partitions and collection. Partitions are meant to hold the data of a specific sensor, while collections would be meant hold all the data of all documents having the sensors type. This would still be a major improvement over passing the data all the documents of the database to the view server to build an index.

An interesting nice-to-have feature would be the ability to partition collections. And maybe conveniences making it easy for users to start with a non-partitioned collection, and then transitioned to a partitioned collection where relevant. Maybe, collection could contain a "partition_key_properties" configuration option that can only be defined at creation define, and accepting either a simple value, or a composite key. A document not containing the properties would not be allowed to be added to the partitioned collection. A compliant document would automatically be added to a partition matching the document idea.

From that point, enabling users to replicate documents from one collection to another one one the same database would make it easy to transition from a non-partitioned collection to a partitioned collection. This is important since it is very difficult to anticipate if partitioned should be used or not, in practice the decision will only to optimize performance after real-world measurements.

Finally, to ease the transition for existing databases (extremely important), it should be possible to add an option to filtered replication allowing users to indicate that filtered document should be added to a specific collection of the destination database.

## Additional context

[TIP]: # ( Why does this feature matter to you? What unique circumstances do you have? )

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4044: Add support for collections

Posted by GitBox <gi...@apache.org>.

nickva commented on issue #4044:
URL: https://github.com/apache/couchdb/issues/4044#issuecomment-1150307821

   Thank you for your proposal and for reaching out, @nkosi23!
   
   Yeah, I think something like that would be nice to have. Or, just in general it would just be nice to remove the partitioned shard size caveat and have it all work as expected. Then partitions and collections would just work the same. 
   
   However, given how partitions are implemented currently that would be a bit hard to achieve. A partition prefix is used to decide which shard the document belongs to.  In the normal, non-partitioned case, the scheme is `hash(doc1) -> couch.1`, or `hash(doc2) -> couch.2` etc. In the partitioned case, document `x:a` does `hash(x)` and that ends up on `1.couch`. Document `y:b` does `hash(y)` and the hash result puts it on shard `2.couch`. So, if partition `x` has 1B documents, it would get quite large and essentially each partition is a bit like a `q=1` database. That's why that warning in the docs. Of course, there could be multiple partitions mapping to the same shard too, so `z:c` with `hash(z)` mapping to `1.couch` as well. 
   
   Now, the idea might be to have a multi-level shard mapping scheme. First based on the the collection/partition, then another level based on the rest of the doc id. So a `hash(x)` then would be followed by `hash(a)` which would determine if the document should live on `1-1.couch`, `1-2.couch` ... `1-N.couch` shard files etc. And that gets a bit tricky as you can imagine.
   
   At a higher level, this goes down to the fundamental sharding strategy used with their associated trade-offs:
      1) **random sharding (using hashing)** which is what CouchDB uses. This favors fast writes with uniformly distributed load across shards. However, it makes reads cumbersome as they have to spread the query to all the shard ranges and then wait for all to respond.
      2) **range based sharding** which is what other databases use (FoundationDB for instance), where the sorted key range is split. So, it might end up with ranges `a..r` on file `1.couch` and `s...z` on `2.couch`. This scheme makes it easy to end up with a single hot shard  during incrementing key updates (inserting 1M docs starting with "user15..."). And of course, as in shown in the example on purpose, the shards become easily imbalanced (a lot more `a..r` docs vs `s..z` ones). However, reads becomes really efficient, because now start/end key ranges can pick out just the relevant shards.
   
   Of course you can have multi-level schemes - levels of sharding followed by range based hashing or vice versa.
   
   Another interesting idea we had been kicking around is to have the main db shards use the random hashing scheme but indices using a range-based one to allow more efficient start/end key on views. That would be nice but have the same hot shard issue when building the views since now that won't be parallelized as easily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org