You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by da...@apache.org on 2019/01/23 20:15:44 UTC
[couchdb-documentation] branch feature/database-partitions updated:
WIP: Docs for partitioned dbs
This is an automated email from the ASF dual-hosted git repository.
davisp pushed a commit to branch feature/database-partitions
in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git
The following commit(s) were added to refs/heads/feature/database-partitions by this push:
new 6c52725 WIP: Docs for partitioned dbs
6c52725 is described below
commit 6c52725f2fbf9b3220d471a804caeb9c8ac7849c
Author: Paul J. Davis <pa...@gmail.com>
AuthorDate: Wed Jan 23 14:15:19 2019 -0600
WIP: Docs for partitioned dbs
---
src/api/database/common.rst | 5 +
src/index.rst | 1 +
src/partitioned-dbs/index.rst | 368 ++++++++++++++++++++++++++++++++++++++++++
templates/pages/index.html | 9 ++
4 files changed, 383 insertions(+)
diff --git a/src/api/database/common.rst b/src/api/database/common.rst
index 308ab69..6cf9d24 100644
--- a/src/api/database/common.rst
+++ b/src/api/database/common.rst
@@ -86,6 +86,8 @@
:>json string update_seq: An opaque string that describes the state
of the database. Do not rely on this string for counting the number
of updates.
+ :>json boolean props.partitioned: (optional) If present and true this
+ indicates the the database is a partitioned database.
:code 200: Request completed successfully
:code 404: Requested database not found
@@ -126,6 +128,7 @@
"other": {
"data_size": 66982448
},
+ "props": {},
"purge_seq": 0,
"sizes": {
"active": 65031503,
@@ -159,6 +162,8 @@
:query integer n: Replicas. The number of copies of the database in the
cluster. The default is 3, unless overridden in the
:config:option:`cluster config <cluster/n>` .
+ :query boolean partitioned: Whether to create a partitioned database.
+ Default is false.
:<header Accept: - :mimetype:`application/json`
- :mimetype:`text/plain`
:>header Content-Type: - :mimetype:`application/json`
diff --git a/src/index.rst b/src/index.rst
index a592539..9251a31 100644
--- a/src/index.rst
+++ b/src/index.rst
@@ -47,6 +47,7 @@ Apache CouchDB
api/index
json-structure
query-server/index
+ partitioned-dbs/index
.. toctree::
:caption: Other
diff --git a/src/partitioned-dbs/index.rst b/src/partitioned-dbs/index.rst
new file mode 100644
index 0000000..805f68d
--- /dev/null
+++ b/src/partitioned-dbs/index.rst
@@ -0,0 +1,368 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+.. http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _partitioned-dbs:
+
+=====================
+Partitioned Databases
+=====================
+
+As a means to introducing partitioned databases we'll consider a motivating
+use case to describe the benefits of this feature. For this example we'll
+consider a database that stores readings from a large network of soil
+moisture sensors.
+
+.. note::
+ Before reading this document you should be familiar with the
+ :ref:`theory <cluster/theory>` of :ref:`sharding <cluster/sharding>`
+ in CouchDB.
+
+
+Traditionally, a document in this database may have something like the
+following structure:
+
+.. code-block:: javascript
+
+ {
+ "_id": "sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+ "_rev":"1-14e8f3262b42498dbd5c672c9d461ff0",
+ "sensor_id": "sensor-260",
+ "location": [41.6171031, -93.7705674],
+ "field_name": "Bob's Corn Field #5",
+ "readings": [
+ ["2019-01-21T00:00:00", 0.15],
+ ["2019-01-21T06:00:00", 0.14],
+ ["2019-01-21T12:00:00", 0.16],
+ ["2019-01-21T18:00:00", 0.11]
+ ]
+ }
+
+
+.. note::
+ While this example uses IoT sensors, the main thing to consider is that
+ there is a logical grouping of documents. Similar use cases might be
+ documents grouped by user or scientific data grouped by experiment.
+
+
+So we've got a bunch of sensors, all grouped by the field they monitor
+along with their readouts for a given day (or other appropriate time period).
+
+Along with our documents we might expect to have two secondary indexes
+for querying our database that might look something like:
+
+.. code-block:: javascript
+
+ function(doc) {
+ if(doc._id.indexOf("sensor-reading-") != 0) {
+ return;
+ }
+ for(var r in doc.readings) {
+ emit([doc.sensor_id, r[0]], r[1])
+ }
+ }
+
+and:
+
+.. code-block:: javascript
+
+ function(doc) {
+ if(doc._id.indexOf("sensor-reading-") != 0) {
+ return;
+ }
+ emit(doc.field_name, doc.sensor_id)
+ }
+
+With these two indexes defined we can easily find all requests for a given
+sensor or list all sensors in a given field.
+
+Unfortunately, in CouchDB when we read from either of these indexes it
+requires finding a copy of every shard and asking for any documents related
+to the particular sensor or field. This means that as our database scales
+up the number of shards, the more work every index request must perform.
+Fortunately for you dear reader, partitioned databases were created to solve
+this precise problem.
+
+
+What is a partition?
+====================
+
+In the previous section we introduced a hypothetical database that contains
+sensor readings from an IoT field monitoring service. In this particular
+use case it's quite logical to group all documents by their ``sensor_id``
+field. In this case we would call the sensor_id the partition.
+
+A good partition has two basic properties. First, it should have a high
+cardinality. That is, there is a large number of values for the partition.
+A database that has a single partition would be an anti-pattern for this
+feature. Secondly, the amount of data per partition should be "small". The
+general recommendation is to limit individual partitions to less than ten
+gigabytes of data. Which for the example sensor documents equates to roughly
+60,000 years of data.
+
+
+Why use partitions?
+===================
+
+Speed!
+
+
+Partitions By Example
+=====================
+
+To create a partitioned database we simply need to pass a query string
+parameter.
+
+.. code-block:: bash
+
+ shell> curl -X PUT http://127.0.0.1:5984/my_new_db?partitioned=true
+ {"ok":true}
+
+To see that our database is partitioned we can look at the database
+information:
+
+.. code-block:: bash
+
+ shell> curl http://127.0.0.1:5984/my_new_db
+ {
+ "cluster": {
+ "n": 3,
+ "q": 8,
+ "r": 2,
+ "w": 2
+ },
+ "compact_running": false,
+ "data_size": 0,
+ "db_name": "my_new_db",
+ "disk_format_version": 7,
+ "disk_size": 66784,
+ "doc_count": 0,
+ "doc_del_count": 0,
+ "instance_start_time": "0",
+ "other": {
+ "data_size": 0
+ },
+ "props": {
+ "partitioned": true
+ },
+ "purge_seq": "0-g1AAAAFDeJzLYWBg4M...",
+ "sizes": {
+ "active": 0,
+ "external": 0,
+ "file": 66784
+ },
+ "update_seq": "0-g1AAAAFDeJzLYWBg4M..."
+ }
+
+
+You'll now see that the ``"props"`` member contains ``"partitioned": true``.
+
+.. note::
+
+ The format for document ids in a partitioned database is
+ ``partition:docid``. Every regular document (i.e., everything
+ except design and local documents) in a partitioned database
+ must follow this format.
+
+Now that we've created a partitioned database its time to add some documents.
+Using our earlier example we could do this as such:
+
+.. code-block:: bash
+
+ shell> cat doc.json
+ {
+ "_id": "sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+ "sensor_id": "sensor-260",
+ "location": [41.6171031, -93.7705674],
+ "field_name": "Bob's Corn Field #5",
+ "readings": [
+ ["2019-01-21T00:00:00", 0.15],
+ ["2019-01-21T06:00:00", 0.14],
+ ["2019-01-21T12:00:00", 0.16],
+ ["2019-01-21T18:00:00", 0.11]
+ ]
+ }
+ shell> $ curl -X POST -H "Content-Type: application/json" \
+ http://127.0.0.1:5984/my_new_db -d @doc.json
+ {
+ "ok": true,
+ "id": "sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+ "rev": "1-05ed6f7abf84250e213fcb847387f6f5"
+ }
+
+The only change required to the first example document is that we are now
+including the partition name in the document id by prepending the
+old id separated by a colon.
+
+.. note::
+
+ The partition name in the document id is not magical. Internally
+ the database is simply using only the partition for hashing
+ the document to a given shard instead of the entire document id.
+
+Working with documents in a partitioned database is no different than
+a non-partitioned database. All APIs are available and existing client
+code will all work seamlessly.
+
+Now that we have created a document we can get some info about the partition
+containing the document:
+
+.. code-block:: bash
+
+ shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260
+ {
+ "db_name": "my_new_db",
+ "doc_count": 1,
+ "doc_del_count": 0,
+ "partition": "sensor-260",
+ "sizes": {
+ "active": 244,
+ "external": 347
+ }
+ }
+
+And we can also list all documents in a partition:
+
+.. code-block: bash
+ shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260/_all_docs
+ {"total_rows": 1, "offset": 0, "rows":[
+ {
+ "id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+ "key":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+ "value": {"rev": "1-05ed6f7abf84250e213fcb847387f6f5"}
+ }
+ ]}
+
+Note that we can use all of the normal bells and whistles available to
+``_all_docs`` requests. Accessing ``_all_docs`` through the
+``/dbname/_partition/name/_all_docs`` endpoint is mostly a convenience
+so that requests are guaranteed to be scoped to a given partition. Users
+are free to use the normal ``/dbname/_all_docs`` to read documents from
+multiple partitions.
+
+Next, we'll create a design document containing our index for
+getting all readings from a given sensor. The map function is similar to
+our earlier example except we've accounted for the change in the document
+id.
+
+.. code-block:: javascript
+
+ function(doc) {
+ if(doc._id.indexOf(":sensor-reading-") < 0) {
+ return;
+ }
+ for(var r in doc.readings) {
+ emit([doc.sensor_id, r[0]], r[1])
+ }
+ }
+
+We can go ahead and upload our design document and try out a partitioned
+query:
+
+.. code-block:: bash
+
+ shell> cat ddoc.json
+ {
+ "_id": "_design/sensor-readings",
+ "views": {
+ "by_sensor": {
+ "map": "function(doc) { ... }"
+ }
+ }
+ }
+ shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260/_design/sensor-readings/_view/by_sensor
+ {"total_rows":4,"offset":0,"rows":[
+ {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","0"],"value":null},
+ {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","1"],"value":null},
+ {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","2"],"value":null},
+ {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","3"],"value":null}
+ ]}
+
+Hooray! Our first partitioned query. For experienced users that may not
+be the most exciting development given that the only things that have
+changed are a slight tweak to the document id and accessing views with
+a slightly different path. However, for anyone that likes performance
+improvements its actually a big deal. By knowing that the view results
+are all located within the provided partition name our partitioned
+queries now perform nearly as fast as document lookups!
+
+The last thing we'll look at is how to query data across multiple partitions.
+For that we'll implement the example sensors by field query from our
+initial example. The map function will use the same update to account
+for the new document id format, but is otherwise identical to the previous
+version:
+
+.. code-block:: javascript
+
+ function(doc) {
+ if(doc._id.indexOf(":sensor-reading-") < 0) {
+ return;
+ }
+ emit(doc.field_name, doc.sensor_id)
+ }
+
+Next we'll create a new design doc with this function. Be sure to notice
+that the ``"options"`` member contains ``"partitioned": false``.
+
+.. code-block:: bash
+
+ shell> cat ddoc2.json
+ {
+ "_id": "_design/all_sensors",
+ "options": {
+ "partitioned": false
+ },
+ "views": {
+ "by_field": {
+ "map": "function(doc) { ... }"
+ }
+ }
+ }
+ shell> $ curl -X POST -H "Content-Type: application/json" http://127.0.0.1:5984/my_new_db -d @ddoc2.json
+ {
+ "ok": true,
+ "id": "_design/all_sensors",
+ "rev": "1-4a8188d80fab277fccf57bdd7154dec1"
+ }
+
+.. note::
+
+ Design documents in a partitioned database default to being
+ partitioned. Design documents that contain views for queries
+ across multiple partitions must contain the ``"partitioned": false``
+ member in the ``"options"`` object.
+
+.. note::
+
+ Design documents are either partitioned or global. They cannot
+ contain a mix of partitioned and global indexes.
+
+And to see a request showing us all sensors in a field we would use a
+request like:
+
+.. code-block:: bash
+
+ shell> curl -u adm:pass http://127.0.0.1:15984/my_new_db/_design/all_sensors/_view/by_field
+ {"total_rows":1,"offset":0,"rows":[
+ {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":"Bob's Corn Field #5","value":"sensor-260"}
+ ]}
+
+Notice that we're not using the ``/dbname/_partition/...`` path for global
+queries. This is because global queries by definition to not cover a single
+partition. Other than having the `"partitioned": false` parameter in the
+design document, global design documents and queries are identical in
+behavior to design documents on non-partitioned databases.
+
+.. warning::
+
+ To be clear, this means that global queries perform identically to
+ queries on non-partitioned databases. Only partitioned queries
+ on a partitioned database benefit from the performance improvements.
diff --git a/templates/pages/index.html b/templates/pages/index.html
index 4c26bfc..5902bf0 100644
--- a/templates/pages/index.html
+++ b/templates/pages/index.html
@@ -169,6 +169,15 @@ specific language governing permissions and limitations under the License.
how to take care of your CouchDB
</span>
</p>
+ <p class="biglink">
+ <a class="biglink" href="{{ pathto("partitioned-dbs/index") }}">
+ Partitioned Databases
+ </a>
+ <br />
+ <span class="linkdescr">
+ how to use Partitioned Databases in CouchDB
+ </span>
+ </p>
</td>
</tr>
</table>