You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by ga...@apache.org on 2019/05/13 14:31:16 UTC

[couchdb-documentation] branch rfc/008-map-indexes created (now 266200b)

This is an automated email from the ASF dual-hosted git repository.

garren pushed a change to branch rfc/008-map-indexes
in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git.


      at 266200b  add rfc for map indexes

This branch includes the following new commits:

     new 266200b  add rfc for map indexes

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[couchdb-documentation] 01/01: add rfc for map indexes

Posted by ga...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

garren pushed a commit to branch rfc/008-map-indexes
in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git

commit 266200bec031c0085790c7f264763afd9f4c42e8
Author: Garren Smith <ga...@gmail.com>
AuthorDate: Mon May 13 16:30:30 2019 +0200

    add rfc for map indexes
---
 rfcs/008-map-indexes.md | 159 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/rfcs/008-map-indexes.md b/rfcs/008-map-indexes.md
new file mode 100644
index 0000000..961eb0b
--- /dev/null
+++ b/rfcs/008-map-indexes.md
@@ -0,0 +1,159 @@
+# Map indexes RFC
+
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: ‘Map indexes on FoundationDB’
+labels: rfc, discussion
+assignees: ''
+
+---
+
+## Introduction
+
+This document describes the data model and index management for building and querying map indexes.
+
+## Abstract
+
+Map indexes will have their own data model stored in FoundationDB. The model includes grouping map indexes via their design doc's view signature. Each index will have the index key/value pairs stored, along with the last sequence number from the changes feed used to update the index.
+
+Indexes will use the changes feed and be updated via the background tasks queue. If the index only needs a very small update, the update can happen in the request instead of via the background job queue.
+
+There will be new size limitations on keys (10KB) and values (100KB) that are emitted from a map function.
+
+## Requirements Language
+
+[NOTE]: # ( Do not alter the section below. Follow its instructions. )
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
+document are to be interpreted as described in
+[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+`Sequence`: a 13 byte value formed by combining the current `Incarnation` of the database and the `Versionstamp` of the transaction. Sequences are monotonically increasing even when a database is relocated across FoundationDB clusters. See (RFC002)[LINK TBD]  for a full explanation.
+
+`View Signature`:  A md5 hash of the views, options, view language defined in a design document.
+
+---
+
+## Detailed Description
+
+CouchDB views are used to create secondary indexes for the documents stored in a CouchDB database. An index is defined by creating a map/reduce functions in a design document. This document describes building the map indexes on top of FoundationDB (FDB).
+
+### Data model
+
+A map index is created via a design document, an example is shown below:
+
+```json
+{
+  "_id": "_design/design-doc-id",
+  "_rev": "1-8d361a23b4cb8e213f0868ea3d2742c2",
+  "views": {
+    "map-view": {
+      "map": "function (doc) {\n  emit(doc._id, 1);\n}"
+    }
+  },
+  "language": "javascript"
+}
+```
+
+The view’s map function will be used to generate keys and values which will be stored in FDB as the secondary index. Format for storing the key/values is:
+
+```json
+{<database>, ?VIEWS, <view_signature>, ?VIEWS, <view_id>, ?MAP, <keys>, <_id>} -> <emitted_value>
+```
+
+Where each field is defined as:
+
+* `<database>` is the specific database namespace
+* `?VIEWS` is the standard views namespace.
+* `view_signature` is the design documents `View Signature`
+* `view_id` name of a view defined in the design document
+* `?MAP` is the standard map namespace
+* `keys` are the emitted keys from the map function
+* `values` is the emitted value from the map function
+
+### Key ordering
+
+FoundationDB orders key by byte value which is not how CouchDB currently orders keys. To maintain the way CouchDB currently does view collation, a type value will need to be prepended to each key so that the correct sort order of null < boolean < numbers < strings < arrays < objects is maintained.
+
+Strings will need an additional change in terms of how they are compared with ICU. An ICU sort string will be generated upfront and added to the string key. This value will be used to sort the string in FDB. The original string will be stored so that it can be used when returning the keys to the user.
+
+CouchDB allows duplicate keys to be emitted for an index, to allow for that a counter value will be added to the end of the keys.
+
+### Emitting document
+
+In a map function it is possible to emit the full document as the value, this will cause an issue if the document size is larger than FDB’s value limit of 100 KB. We can handle this in two possible ways.  The first is to keep the hard limit of only allowing 100 KB value to be emitted, so if a document exceeds that CouchDB will return an error. This is the preferred option.
+
+The second option is to detect that a map function is emitting the full document and then add in a foreign key reference back to the document subspace. The issue here is that CouchDB would only be able to return the latest version of the document, which would cause consistency issues when combined with the `update=false` argument.
+
+### Index Management
+
+For every document that needs to be processed for an index, we have to run the document through the javascript query server to get the emitted keys and values. This means that it won’t be possible to update a map/reduce index in the same transaction that a document is updated. To account for this, we will need to keep an `id index` similar to the `id tree`  that is currently keep in CouchDB. This index will hold the document id as the key and the value would be the keys that were emitted [...]
+
+{?DATABASE, ?VIEWS, <view_signature>, ?VIEWS, ?ID_INDEX, <_id>, <view_id>} -> [emitted keys]
+
+Each index will be built and updated via the Background job queue [RFC Link TBD]. When a request for a view is received, the request process will add a job item onto the background queue for the index to be updated. A worker will take the item off the queue and update the index. Once the index has been built, the request will return with the results. This process can also be optimised in two ways. Firstly, using a new couch_events system to listen for document changes in an database and  [...]
+
+Initially the building of an index will be a single worker running through the changes feed and creating the index. Ideally it would be nice to parallelise that work so that multiple workers could build the index at the same time. This will reduce build times. This can be done by fetching the boundary keys for the changes feed, splitting those key ranges amongst different workers to build different parts of the index. This will require that for each document update processed, the worker  [...]
+
+### View clean up
+
+When a design document is changed, new indexes will be built and grouped under a new  `View Signature` . The old map indexes will still be in FDB.  CouchDB will need to run a clean up process that scans all the `View Signature`s and links them to a design document. If a `View Signature` is not linked to any design doc, that `View Signature` namespace can be removed. This will be done via the background jobs scheduler.
+
+### Stale = “ok” and stable = true
+
+ With the consistency guarantee’s CouchDB will get from FDB,  stable = true will no longer be an option that CouchDB would support. Similar `stale = “ok”` would now be translated to `update = false`
+
+### Size limits
+
+* Emitted keys will not be able to exceed 10 KB
+* Values cannot exceed 100 KB
+* There could be rare cases where the number of key-value pairs emitted for a map function could lead to a transaction either exceeding 10 MB in size which isn’t allowed or exceeding 5 MB which impacts the performance of the cluster. Ideally CouchDB will need to detect these situations and split the transaction into smaller transactions
+
+## Advantages
+
+* Map indexes will work on FoundationDB with same behaviour as current CouchDB 2.x
+* Options like stale = “ok” and ‘stable = true’ will no longer be needed
+
+## Disadvantages
+
+* Size limits on key and values
+
+## Key Changes
+
+* Indexes are stored in FoundationDB
+* Indexes will be built via the background job queue
+* ICU sort strings will be generated ahead of time for each key that is a string
+
+## Applications and Modules affected
+
+* couch_mrview will be removed and replaced with a new indexing OTP application
+
+## HTTP API additions
+
+The API will remain the same.
+
+## HTTP API deprecations
+
+No deprecations.
+
+## Security Considerations
+
+None have been identified.
+
+## References
+
+* TBD link to background tasks RFC
+* [Original mailing list discussion](https://lists.apache.org/thread.html/5cb6e1dbe9d179869576b6b2b67bca8d86b30583bced9924d0bbe122@%3Cdev.couchdb.apache.org%3E)
+
+## Acknowledgements
+
+Thanks to everyone that participated on the mailing list discussion
+
+* @janl
+* @kocolosk
+* @willholley
+* @mikerhodes