You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by da...@apache.org on 2019/01/23 20:15:44 UTC

[couchdb-documentation] branch feature/database-partitions updated: WIP: Docs for partitioned dbs

This is an automated email from the ASF dual-hosted git repository.

davisp pushed a commit to branch feature/database-partitions
in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git


The following commit(s) were added to refs/heads/feature/database-partitions by this push:
     new 6c52725  WIP: Docs for partitioned dbs
6c52725 is described below

commit 6c52725f2fbf9b3220d471a804caeb9c8ac7849c
Author: Paul J. Davis <pa...@gmail.com>
AuthorDate: Wed Jan 23 14:15:19 2019 -0600

    WIP: Docs for partitioned dbs
---
 src/api/database/common.rst   |   5 +
 src/index.rst                 |   1 +
 src/partitioned-dbs/index.rst | 368 ++++++++++++++++++++++++++++++++++++++++++
 templates/pages/index.html    |   9 ++
 4 files changed, 383 insertions(+)

diff --git a/src/api/database/common.rst b/src/api/database/common.rst
index 308ab69..6cf9d24 100644
--- a/src/api/database/common.rst
+++ b/src/api/database/common.rst
@@ -86,6 +86,8 @@
     :>json string update_seq: An opaque string that describes the state
       of the database. Do not rely on this string for counting the number
       of updates.
+    :>json boolean props.partitioned: (optional) If present and true this
+      indicates the the database is a partitioned database.
     :code 200: Request completed successfully
     :code 404: Requested database not found
 
@@ -126,6 +128,7 @@
             "other": {
                 "data_size": 66982448
             },
+            "props": {},
             "purge_seq": 0,
             "sizes": {
                 "active": 65031503,
@@ -159,6 +162,8 @@
     :query integer n: Replicas. The number of copies of the database in the
       cluster. The default is 3, unless overridden in the
       :config:option:`cluster config <cluster/n>` .
+    :query boolean partitioned: Whether to create a partitioned database.
+      Default is false.
     :<header Accept: - :mimetype:`application/json`
                      - :mimetype:`text/plain`
     :>header Content-Type: - :mimetype:`application/json`
diff --git a/src/index.rst b/src/index.rst
index a592539..9251a31 100644
--- a/src/index.rst
+++ b/src/index.rst
@@ -47,6 +47,7 @@ Apache CouchDB
     api/index
     json-structure
     query-server/index
+    partitioned-dbs/index
 
 .. toctree::
     :caption: Other
diff --git a/src/partitioned-dbs/index.rst b/src/partitioned-dbs/index.rst
new file mode 100644
index 0000000..805f68d
--- /dev/null
+++ b/src/partitioned-dbs/index.rst
@@ -0,0 +1,368 @@
+.. Licensed under the Apache License, Version 2.0 (the "License"); you may not
+.. use this file except in compliance with the License. You may obtain a copy of
+.. the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+.. License for the specific language governing permissions and limitations under
+.. the License.
+
+.. _partitioned-dbs:
+
+=====================
+Partitioned Databases
+=====================
+
+As a means to introducing partitioned databases we'll consider a motivating
+use case to describe the benefits of this feature. For this example we'll
+consider a database that stores readings from a large network of soil
+moisture sensors.
+
+.. note::
+    Before reading this document you should be familiar with the
+    :ref:`theory <cluster/theory>` of :ref:`sharding <cluster/sharding>`
+    in CouchDB.
+
+
+Traditionally, a document in this database may have something like the
+following structure:
+
+.. code-block:: javascript
+
+    {
+        "_id": "sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+        "_rev":"1-14e8f3262b42498dbd5c672c9d461ff0",
+        "sensor_id": "sensor-260",
+        "location": [41.6171031, -93.7705674],
+        "field_name": "Bob's Corn Field #5",
+        "readings": [
+            ["2019-01-21T00:00:00", 0.15],
+            ["2019-01-21T06:00:00", 0.14],
+            ["2019-01-21T12:00:00", 0.16],
+            ["2019-01-21T18:00:00", 0.11]
+        ]
+    }
+
+
+.. note::
+    While this example uses IoT sensors, the main thing to consider is that
+    there is a logical grouping of documents. Similar use cases might be
+    documents grouped by user or scientific data grouped by experiment.
+
+
+So we've got a bunch of sensors, all grouped by the field they monitor
+along with their readouts for a given day (or other appropriate time period).
+
+Along with our documents we might expect to have two secondary indexes
+for querying our database that might look something like:
+
+.. code-block:: javascript
+
+    function(doc) {
+        if(doc._id.indexOf("sensor-reading-") != 0) {
+            return;
+        }
+        for(var r in doc.readings) {
+            emit([doc.sensor_id, r[0]], r[1])
+        }
+    }
+
+and:
+
+.. code-block:: javascript
+
+    function(doc) {
+        if(doc._id.indexOf("sensor-reading-") != 0) {
+            return;
+        }
+        emit(doc.field_name, doc.sensor_id)
+    }
+
+With these two indexes defined we can easily find all requests for a given
+sensor or list all sensors in a given field.
+
+Unfortunately, in CouchDB when we read from either of these indexes it
+requires finding a copy of every shard and asking for any documents related
+to the particular sensor or field. This means that as our database scales
+up the number of shards, the more work every index request must perform.
+Fortunately for you dear reader, partitioned databases were created to solve
+this precise problem.
+
+
+What is a partition?
+====================
+
+In the previous section we introduced a hypothetical database that contains
+sensor readings from an IoT field monitoring service. In this particular
+use case it's quite logical to group all documents by their ``sensor_id``
+field. In this case we would call the sensor_id the partition.
+
+A good partition has two basic properties. First, it should have a high
+cardinality. That is, there is a large number of values for the partition.
+A database that has a single partition would be an anti-pattern for this
+feature. Secondly, the amount of data per partition should be "small". The
+general recommendation is to limit individual partitions to less than ten
+gigabytes of data. Which for the example sensor documents equates to roughly
+60,000 years of data.
+
+
+Why use partitions?
+===================
+
+Speed!
+
+
+Partitions By Example
+=====================
+
+To create a partitioned database we simply need to pass a query string
+parameter.
+
+.. code-block:: bash
+
+    shell> curl -X PUT http://127.0.0.1:5984/my_new_db?partitioned=true
+    {"ok":true}
+
+To see that our database is partitioned we can look at the database
+information:
+
+.. code-block:: bash
+
+    shell> curl http://127.0.0.1:5984/my_new_db
+    {
+      "cluster": {
+        "n": 3,
+        "q": 8,
+        "r": 2,
+        "w": 2
+      },
+      "compact_running": false,
+      "data_size": 0,
+      "db_name": "my_new_db",
+      "disk_format_version": 7,
+      "disk_size": 66784,
+      "doc_count": 0,
+      "doc_del_count": 0,
+      "instance_start_time": "0",
+      "other": {
+        "data_size": 0
+      },
+      "props": {
+        "partitioned": true
+      },
+      "purge_seq": "0-g1AAAAFDeJzLYWBg4M...",
+      "sizes": {
+        "active": 0,
+        "external": 0,
+        "file": 66784
+      },
+      "update_seq": "0-g1AAAAFDeJzLYWBg4M..."
+    }
+
+
+You'll now see that the ``"props"`` member contains ``"partitioned": true``.
+
+.. note::
+
+    The format for document ids in a partitioned database is
+    ``partition:docid``. Every regular document (i.e., everything
+    except design and local documents) in a partitioned database
+    must follow this format.
+
+Now that we've created a partitioned database its time to add some documents.
+Using our earlier example we could do this as such:
+
+.. code-block:: bash
+
+    shell> cat doc.json
+    {
+        "_id": "sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+        "sensor_id": "sensor-260",
+        "location": [41.6171031, -93.7705674],
+        "field_name": "Bob's Corn Field #5",
+        "readings": [
+            ["2019-01-21T00:00:00", 0.15],
+            ["2019-01-21T06:00:00", 0.14],
+            ["2019-01-21T12:00:00", 0.16],
+            ["2019-01-21T18:00:00", 0.11]
+        ]
+    }
+    shell> $ curl -X POST -H "Content-Type: application/json" \
+                http://127.0.0.1:5984/my_new_db -d @doc.json
+    {
+        "ok": true,
+        "id": "sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+        "rev": "1-05ed6f7abf84250e213fcb847387f6f5"
+    }
+
+The only change required to the first example document is that we are now
+including the partition name in the document id by prepending the
+old id separated by a colon.
+
+.. note::
+
+    The partition name in the document id is not magical. Internally
+    the database is simply using only the partition for hashing
+    the document to a given shard instead of the entire document id.
+
+Working with documents in a partitioned database is no different than
+a non-partitioned database. All APIs are available and existing client
+code will all work seamlessly.
+
+Now that we have created a document we can get some info about the partition
+containing the document:
+
+.. code-block:: bash
+
+    shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260
+    {
+      "db_name": "my_new_db",
+      "doc_count": 1,
+      "doc_del_count": 0,
+      "partition": "sensor-260",
+      "sizes": {
+        "active": 244,
+        "external": 347
+      }
+    }
+
+And we can also list all documents in a partition:
+
+.. code-block: bash
+    shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260/_all_docs
+    {"total_rows": 1, "offset": 0, "rows":[
+        {
+            "id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+            "key":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf",
+            "value": {"rev": "1-05ed6f7abf84250e213fcb847387f6f5"}
+        }
+    ]}
+
+Note that we can use all of the normal bells and whistles available to
+``_all_docs`` requests. Accessing ``_all_docs`` through the
+``/dbname/_partition/name/_all_docs`` endpoint is mostly a convenience
+so that requests are guaranteed to be scoped to a given partition. Users
+are free to use the normal ``/dbname/_all_docs`` to read documents from
+multiple partitions.
+
+Next, we'll create a design document containing our index for
+getting all readings from a given sensor. The map function is similar to
+our earlier example except we've accounted for the change in the document
+id.
+
+.. code-block:: javascript
+
+    function(doc) {
+        if(doc._id.indexOf(":sensor-reading-") < 0) {
+            return;
+        }
+        for(var r in doc.readings) {
+            emit([doc.sensor_id, r[0]], r[1])
+        }
+    }
+
+We can go ahead and upload our design document and try out a partitioned
+query:
+
+.. code-block:: bash
+
+    shell> cat ddoc.json
+    {
+        "_id": "_design/sensor-readings",
+        "views": {
+            "by_sensor": {
+                "map": "function(doc) { ... }"
+            }
+        }
+    }
+    shell> curl http://127.0.0.1:5984/my_new_db/_partition/sensor-260/_design/sensor-readings/_view/by_sensor
+    {"total_rows":4,"offset":0,"rows":[
+    {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","0"],"value":null},
+    {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","1"],"value":null},
+    {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","2"],"value":null},
+    {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":["sensor-260","3"],"value":null}
+    ]}
+
+Hooray! Our first partitioned query. For experienced users that may not
+be the most exciting development given that the only things that have
+changed are a slight tweak to the document id and accessing views with
+a slightly different path. However, for anyone that likes performance
+improvements its actually a big deal. By knowing that the view results
+are all located within the provided partition name our partitioned
+queries now perform nearly as fast as document lookups!
+
+The last thing we'll look at is how to query data across multiple partitions.
+For that we'll implement the example sensors by field query from our
+initial example. The map function will use the same update to account
+for the new document id format, but is otherwise identical to the previous
+version:
+
+.. code-block:: javascript
+
+    function(doc) {
+        if(doc._id.indexOf(":sensor-reading-") < 0) {
+            return;
+        }
+        emit(doc.field_name, doc.sensor_id)
+    }
+
+Next we'll create a new design doc with this function. Be sure to notice
+that the ``"options"`` member contains ``"partitioned": false``.
+
+.. code-block:: bash
+
+    shell> cat ddoc2.json
+    {
+      "_id": "_design/all_sensors",
+      "options": {
+        "partitioned": false
+      },
+      "views": {
+        "by_field": {
+          "map": "function(doc) { ... }"
+        }
+      }
+    }
+    shell> $ curl -X POST -H "Content-Type: application/json" http://127.0.0.1:5984/my_new_db -d @ddoc2.json
+    {
+        "ok": true,
+        "id": "_design/all_sensors",
+        "rev": "1-4a8188d80fab277fccf57bdd7154dec1"
+    }
+
+.. note::
+
+    Design documents in a partitioned database default to being
+    partitioned. Design documents that contain views for queries
+    across multiple partitions must contain the ``"partitioned": false``
+    member in the ``"options"`` object.
+
+.. note::
+
+    Design documents are either partitioned or global. They cannot
+    contain a mix of partitioned and global indexes.
+
+And to see a request showing us all sensors in a field we would use a
+request like:
+
+.. code-block:: bash
+
+    shell> curl -u adm:pass http://127.0.0.1:15984/my_new_db/_design/all_sensors/_view/by_field
+    {"total_rows":1,"offset":0,"rows":[
+    {"id":"sensor-260:sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf","key":"Bob's Corn Field #5","value":"sensor-260"}
+    ]}
+
+Notice that we're not using the ``/dbname/_partition/...`` path for global
+queries. This is because global queries by definition to not cover a single
+partition. Other than having the `"partitioned": false` parameter in the
+design document, global design documents and queries are identical in
+behavior to design documents on non-partitioned databases.
+
+.. warning::
+
+    To be clear, this means that global queries perform identically to
+    queries on non-partitioned databases. Only partitioned queries
+    on a partitioned database benefit from the performance improvements.
diff --git a/templates/pages/index.html b/templates/pages/index.html
index 4c26bfc..5902bf0 100644
--- a/templates/pages/index.html
+++ b/templates/pages/index.html
@@ -169,6 +169,15 @@ specific language governing permissions and limitations under the License.
           how to take care of your CouchDB
         </span>
       </p>
+      <p class="biglink">
+        <a class="biglink" href="{{ pathto("partitioned-dbs/index") }}">
+          Partitioned Databases
+        </a>
+        <br />
+        <span class="linkdescr">
+          how to use Partitioned Databases in CouchDB
+        </span>
+      </p>
     </td>
   </tr>
 </table>