You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by va...@apache.org on 2020/08/12 23:02:35 UTC
[couchdb-documentation] 01/01: [RFC] Replicator Implementation for CouchDB 4.x

This is an automated email from the ASF dual-hosted git repository.

vatamane pushed a commit to branch replicator-rfc
in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git

commit b6260c5048818d3274c9af8e66a0ed8bb1d88afc
Author: Nick Vatamaniuc <va...@apache.org>
AuthorDate: Wed Aug 12 19:01:34 2020 -0400

    [RFC] Replicator Implementation for CouchDB 4.x
---
 rfcs/016-fdb-replicator.md | 337 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 337 insertions(+)

diff --git a/rfcs/016-fdb-replicator.md b/rfcs/016-fdb-replicator.md
new file mode 100644
index 0000000..5b7e9b2
--- /dev/null
+++ b/rfcs/016-fdb-replicator.md
@@ -0,0 +1,337 @@
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: 'Replicator Implementation On FDB'
+labels: rfc, discussion
+assignees: ''
+
+---
+
+# Introduction
+
+This document describes the design of the replicator application for CouchDB
+4.x. The replicator will rely on `couch_jobs`
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md)
+for centralized scheduling and monitoring of replication jobs.
+
+## Abstract
+
+CouchDB replicator is the CouchDB application which runs replication jobs.
+Replication jobs can be created from documents in `_replicator` databases, or
+by `POST`-ing requests to the HTTP `/_replicate` endpoint. Previously, in
+CouchDB 2 and 3 replication jobs were maped to individual cluster nodes and a
+scheduler component would run up to `max_jobs` number of replication job at a
+time on each node. The new design proposes using `couch_jobs`, as described in
+this
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md),
+to have a central, FDB-based queue of replication jobs. `couch_jobs`
+application would manage job scheduling and coordination. The new design also
+proposes using heterogenous node types as defined in this
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md)
+such that replication jobs will be created only on `api_frontend` nodes and
+replication jobs will be run only on `replication` nodes.
+
+## Requirements Language
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
+"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
+interpreted as described in [RFC
+2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+`_replicator` databases : A database that is either named `_replicator` or ends
+with the `/_replicator` suffix.
+
+`transient` replications : Replication jobs created by `POST`-ing to the
+`/_replicate` endpoint.
+
+`persistent` replications : Replication jobs created from a document in a
+`_replicator` database.
+
+`continuous` replications : Replication jobs created with the `"continuous":
+true` parameter. When this job reaches the end of the changes feed it will
+continue waiting for new changes in a loop until the user removes the job.
+
+`normal` replications : Replication jobs which are not `continuous`. If the
+`"continuous":true` parameter is not specified, by default, replication jobs
+will be `normal`.
+
+`api_frontend node` : Database node which has the `api_frontend` type set to
+`true` as described in
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
+Replication jobs can be only be created on these nodes.
+
+`replication node` : Database node which has the `replication` type set to
+`true`as described in
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
+Replication jobs jobs can only be run on these nodes.
+
+`filtered` replications: Replications which use a user-defined filter on the
+source endpoint to filter its changes feed.
+
+`replication_id` : A unique ID defined for replication jobs which is a hash of
+ the source and target endpoints, some of the options, and for filtered
+ replications, the contents of the filter from the source endpoint. Replication
+ ID will change, if for example, a filter on the source endpoint changes.
+ Computing this value may require a network round-trip to the source endpoint.
+
+`job_id` : A replication job ID derived from the database and document IDs for
+persistent replications, and from source, target endpoint, user name and some
+options for transient replications. Computing a `job_id`, unlike a
+`replication_id`, doesn't require making any network requests. A filtered
+replication with a given `job_id` during its lifetime may change its
+`replication_id` when filter contents changes on the source.
+
+`max_jobs` : Configuration parameter which specifies up to how many
+replication jobs to run on each `replication` node.
+
+`max_churn` : Configuration parameter which specifies up to how many new jobs
+to spawn during some a configurable scheduling interval.
+
+`min_backoff_penalty` : Configuration parameter specifying the minimum (the base)
+penalty applied to jobs which crash repeatedly.
+
+`max_backoff_penalty` : Configuraiton parameter specifyign the maximum pently
+applied to job which crash repeatedly.
+
+---
+
+# Detailed Description
+
+Replicator application's job creation and scheduling works roughly as follows.
+
+ 1) `Persistent` jobs are created by document updates in `_replicator` dbs and
+ `transient` jobs are created by POST-ing to the `_replicate/` endpoint. In
+ both cases a new `couch_jobs` job is created in a separate `couch_jobs`
+ replication subspace. This happens on `api_frontend` nodes only. These are
+ also the nodes which process all the regular HTTP CRUD requests.
+
+   a) Persisent jobs are driven by an EPI callback mechanism which notifies
+   `couch_replicator` application when documents in `_replicator` dbs are
+   updated, when `_replicator` dbs are created and deleted.
+
+   b) Transient jobs are created from the `_replicate` HTTP handler directly.
+ Newly created jobs are in `pending` state.
+
+ 2) Each `replication` node spawns a small (configurable) amount of acceptor
+ processes which wait in `couch_jobs:accept/2` for jobs scheduled to run at a
+ time less or equal to the current time.
+
+ 3) After the job is accepted its state is updated as `running` and then a
+ gen_server server process monitoring these replication jobs will spawn another
+ acceptor to replace it. That happens until the `max_jobs` running jobs have
+ been spawned. The same gen_server will periodically check if there are any
+ pending jobs in the queue, and spawn up to some `max_churn` number of new
+ acceptors. These new acceptors may start new jobs, and if they do, for each
+ one of them, the oldest running job will be stopped and re-enqueued as
+ pending. This in large follows the logic from the replication scheduler in
+ CouchDB <= 3.x except that is uses `couch_jobs` as the central queueing and
+ scheduling mechanism.
+
+ 4) After the job is marked as `running` it computes the `replication_id`,
+ initializes an internal replication state record from the JSON job data
+ record, and start replicating.
+
+ 5) As the job runs, when it checkpoints it periodically recomputes its
+ `replication_id`. In case of filtered replications the `replication_id` maybe
+ change. In that case the job is stopped and moved to the `pending` state so it
+ can be restarted with the new `replication_id`. Also during checkpointing
+ attempts, the `couch_jobs`'s data is updated such that the job stays active
+ and doesn't get re-enqueued by the `couch_jobs` activity monitor.
+
+ 6) If the job crashes, in `gen_server:terminate/2` it reschedules itself via
+ `couch_jobs:resubmit/3` to run at some future time defined roughly as `now +
+ max(min_backoff_penalty * 2^consecutive_errors, max_backoff_penalty)`. In
+ order to avoid job which had intermetent error in teh past but since "healed"
+ if the job runs successfully after some period of time, configurable as
+ `health_threshold`, the `consecutive_errors` count is reset back to 0.
+
+ 7) If the node where replication job runs crashes, or the job is manually
+ killed via `exit(Pid, kill)`, `couch_jobs` activity monitor will automatically
+ re-enqueue the job to run as `pending`.
+
+## Replicator Job States
+
+### Description
+
+The new replicator in CouchDB 4.x will have less replication job states then before. The
+new set of states will be:
+
+ * `pending` : A job is marked as `pending` in these cases:
+    - As soon as a job is created from an `api_frontend` node
+    - Job is stopped to let other replication jobs run
+    - A filtered replication noticed its `replication_id` has changed
+
+ * `running` : When the job accepted by the `couch_jobs:accept/2` call. This
+   generally means the job is actually running on a node, however in case when
+   a node crashes immediately, the job may indicate `running` on that node
+   until `couch_jobs` activity monitor will re-enqueue the job
+
+ * `crashing` : Then job was running, but then crashed with an intermetent
+   error. Job data has an error count which is bumped and then a bakcoff pently
+   is computed and the the job is rescheduled to try again at some point in the
+   future.
+
+ * `completed` : Normal replications which have completed
+
+ * `failed` : This can happen when
+    - A replication job could not be parsed from a replication document or
+      `_replicate/` POST body. For example is the user has not specified a
+      `"source"` field.
+    - There is another persistent replication job running or pending with the
+      same `replication_id`.
+
+
+### Differences from CouchDB <= 3.x
+
+The differences from before are:
+ * `initializing` state was folded into `pending`
+ * `error` state folded into `crashing`
+
+
+### Mapping Between couch_jobs States and Replication States
+
+The state map from these states to `couch_jobs`'s states of (`pending`,
+`running`, `finished`) looks as follows:
+
+replicator state | couch_jobs state
+------------------------------------
+pending          | pending
+running          | running
+crashing         | pending
+completed        | finished
+failed           | finished
+
+### State Transition Diagram
+
+```
++-----+       +-------+
+|start+------>+pending|
++-----+       +-------+
+                  ^
+                  |
+                  |
+                  v
+              +---+---+      +--------+
+    +---------+running+----->|crashing|
+    |         +---+---+      +--------+
+    |             ^
+    |             |
+    v             v
++------+     +---------+
+|failed|     |completed|
++------+     +---------+
+```
+
+
+## Replication ID and Job IDs mapping
+
+Multiple replication jobs may specify a replications which map to the same
+`replication_id`. To handle duplicates and collission there is a `(...,
+LayerPrefix, ?REPLICATION_IDS, ReplicationID) -> JobID` subspace which keeps
+track of mappings from `replication_id` to `job id`. After the `replication_id`
+is computed each replication job checks if there is already another job pending
+or running with the same `replication_id`. If the other job is transient, then
+the current job will reschedule itself as `crashing`. If the other job is
+persistent, the current job will fail permanently as `failed`.
+
+## Transient Jobs Behavior
+
+In CouchDB <= 3.x transient replication jobs ran in memory on the nodes where
+they were assigned to run. If the node where the replication job ran crashed
+the job would simply disappear without a trace. It was up to the user's code to
+periodically monitor the job status and re-create the job. With the current
+design, `transient` jobs are saved as `couch_jobs` records in FDB and so would
+survive if dbcore nodes crashing and restarting. Technically with the new
+design it becomes possible to keep the transient job results even the failed
+and completed jobs, perhaps up to some configurable amount of time to allow
+users to retrieve their status after they stop running. However in the initial
+implementation, in order to keep maximum compatibility with the <= 3.x
+replicator, transient jobs will be removed when they fail and when they
+complete.
+
+
+## Monitoring Endpoints
+
+`_active_tasks`, `_scheduler/jobs` and `_scheduler/docs` endpoint are handled
+by traversing the replication job's data using a new `couch_jobs:fold_jobs/4`
+function and returning subsets of job's data. `_active_tasks` implementation
+already works that way and `_scheduler/*` endpoint will work similarly.
+
+
+# Advantages and Disadvantages
+
+Advantages:
+
+ * Simplicity: re-using `couch_jobs` means having a lot less code to maintain
+   in `couch_replicator`
+
+ * Simpler endpoint and monitoring implementations
+
+ * Less replication job states to keep track of
+
+ * Improved behavior for transient replications, they know survive node
+   restarts
+
+ * Using node types allows tightening firewall rules such that only
+   `replication` node are the ones which may make arbitrary requests outside
+   the cluster.
+
+Disadvantages:
+
+ * Behavior changes for transient jobs
+
+ * Centralized job queue might mean handling some number of conflicts generated
+   in the FDB backend when jobs are accepted. These are mitigated using a
+   `accept_jitter` configuration parameter and a configurable number of max
+   acceptors per node.
+
+ * In monitoring API responses, `running` job state might not immediately
+   reflect the actually running process state on the remote node. If the node
+   crashes, it might take up to a minute or two until the job is re-enqueued by
+   the `couch_jobs` activity monitor.
+
+# Key Changes
+
+ * Behavior changes for transient jobs
+
+ * A delay in `running` state as reflected in monitoring API responses
+
+## Applications and Modules affected
+
+ * couch_jobs : API to fold jobs and get pending count job estimate
+ * fabric2_db : Adding EPI db create/delete callbacks
+ * couch_replicator :
+    - Remove `couch_replicator_scheduler*` modules
+    - Remove `couch_replicator_doc_processor_*` modules
+    - `couch_replicator` : handles creating jobs and services as an API entry point for the application
+    - `couch_replicator_job` : runs each replication job
+    - `couch_replicator_job_server` : simpler replication job monitoring gen_server
+    - `couch_replicator_parse` : parses replication document and HTTP `_replicate` POST bodies
+
+## HTTP API additions
+
+N/A
+
+## HTTP API deprecations
+
+N/A
+
+# Security Considerations
+
+None
+
+# References
+
+[couch_jobs RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md)
+[node_types RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md)
+[CouchDB 3.x replicator implemenation](https://github.com/apache/couchdb/blob/3.x/src/couch_replicator/README.md)
+
+# Co-authors
+
+ * @davisp
+
+# Acknowledgements
+
+ * @davisp