You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/08/13 03:58:18 UTC
[GitHub] [couchdb-documentation] nickva commented on a change in pull request #581: [RFC] Replicator Implementation for CouchDB 4.x

nickva commented on a change in pull request #581:
URL: https://github.com/apache/couchdb-documentation/pull/581#discussion_r469682126



##########
File path: rfcs/016-fdb-replicator.md
##########
@@ -0,0 +1,337 @@
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: 'Replicator Implementation On FDB'
+labels: rfc, discussion
+assignees: ''
+
+---
+
+# Introduction
+
+This document describes the design of the replicator application for CouchDB
+4.x. The replicator will rely on `couch_jobs`
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md)
+for centralized scheduling and monitoring of replication jobs.
+
+## Abstract
+
+CouchDB replicator is the CouchDB application which runs replication jobs.
+Replication jobs can be created from documents in `_replicator` databases, or
+by `POST`-ing requests to the HTTP `/_replicate` endpoint. Previously, in
+CouchDB 2 and 3 replication jobs were maped to individual cluster nodes and a
+scheduler component would run up to `max_jobs` number of replication job at a
+time on each node. The new design proposes using `couch_jobs`, as described in
+this
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md),
+to have a central, FDB-based queue of replication jobs. `couch_jobs`
+application would manage job scheduling and coordination. The new design also
+proposes using heterogenous node types as defined in this
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md)
+such that replication jobs will be created only on `api_frontend` nodes and
+replication jobs will be run only on `replication` nodes.
+
+## Requirements Language
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
+"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
+interpreted as described in [RFC
+2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+`_replicator` databases : A database that is either named `_replicator` or ends
+with the `/_replicator` suffix.
+
+`transient` replications : Replication jobs created by `POST`-ing to the
+`/_replicate` endpoint.
+
+`persistent` replications : Replication jobs created from a document in a
+`_replicator` database.
+
+`continuous` replications : Replication jobs created with the `"continuous":
+true` parameter. When this job reaches the end of the changes feed it will
+continue waiting for new changes in a loop until the user removes the job.
+
+`normal` replications : Replication jobs which are not `continuous`. If the
+`"continuous":true` parameter is not specified, by default, replication jobs
+will be `normal`.
+
+`api_frontend node` : Database node which has the `api_frontend` type set to
+`true` as described in
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
+Replication jobs can be only be created on these nodes.
+
+`replication node` : Database node which has the `replication` type set to
+`true`as described in
+[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
+Replication jobs jobs can only be run on these nodes.
+
+`filtered` replications: Replications which use a user-defined filter on the
+source endpoint to filter its changes feed.
+
+`replication_id` : A unique ID defined for replication jobs which is a hash of
+ the source and target endpoints, some of the options, and for filtered
+ replications, the contents of the filter from the source endpoint. Replication
+ ID will change, if for example, a filter on the source endpoint changes.
+ Computing this value may require a network round-trip to the source endpoint.
+
+`job_id` : A replication job ID derived from the database and document IDs for
+persistent replications, and from source, target endpoint, user name and some
+options for transient replications. Computing a `job_id`, unlike a
+`replication_id`, doesn't require making any network requests. A filtered
+replication with a given `job_id` during its lifetime may change its
+`replication_id` when filter contents changes on the source.
+
+`max_jobs` : Configuration parameter which specifies up to how many
+replication jobs to run on each `replication` node.
+
+`max_churn` : Configuration parameter which specifies up to how many new jobs
+to spawn during some a configurable scheduling interval.
+
+`min_backoff_penalty` : Configuration parameter specifying the minimum (the base)
+penalty applied to jobs which crash repeatedly.
+
+`max_backoff_penalty` : Configuraiton parameter specifyign the maximum pently
+applied to job which crash repeatedly.
+
+---
+
+# Detailed Description
+
+Replicator application's job creation and scheduling works roughly as follows.
+
+ 1) `Persistent` jobs are created by document updates in `_replicator` dbs and
+ `transient` jobs are created by POST-ing to the `_replicate/` endpoint. In
+ both cases a new `couch_jobs` job is created in a separate `couch_jobs`
+ replication subspace. This happens on `api_frontend` nodes only. These are
+ also the nodes which process all the regular HTTP CRUD requests.
+
+   a) Persisent jobs are driven by an EPI callback mechanism which notifies
+   `couch_replicator` application when documents in `_replicator` dbs are
+   updated, when `_replicator` dbs are created and deleted.
+
+   b) Transient jobs are created from the `_replicate` HTTP handler directly.
+ Newly created jobs are in `pending` state.
+
+ 2) Each `replication` node spawns a small (configurable) amount of acceptor
+ processes which wait in `couch_jobs:accept/2` for jobs scheduled to run at a
+ time less or equal to the current time.
+
+ 3) After the job is accepted its state is updated as `running` and then a
+ gen_server server process monitoring these replication jobs will spawn another
+ acceptor to replace it. That happens until the `max_jobs` running jobs have
+ been spawned. The same gen_server will periodically check if there are any
+ pending jobs in the queue, and spawn up to some `max_churn` number of new
+ acceptors. These new acceptors may start new jobs, and if they do, for each
+ one of them, the oldest running job will be stopped and re-enqueued as
+ pending. This in large follows the logic from the replication scheduler in
+ CouchDB <= 3.x except that is uses `couch_jobs` as the central queueing and
+ scheduling mechanism.
+
+ 4) After the job is marked as `running` it computes the `replication_id`,
+ initializes an internal replication state record from the JSON job data
+ record, and start replicating.
+
+ 5) As the job runs, when it checkpoints it periodically recomputes its
+ `replication_id`. In case of filtered replications the `replication_id` maybe
+ change. In that case the job is stopped and moved to the `pending` state so it
+ can be restarted with the new `replication_id`. Also during checkpointing
+ attempts, the `couch_jobs`'s data is updated such that the job stays active
+ and doesn't get re-enqueued by the `couch_jobs` activity monitor.
+
+ 6) If the job crashes, in `gen_server:terminate/2` it reschedules itself via
+ `couch_jobs:resubmit/3` to run at some future time defined roughly as `now +
+ max(min_backoff_penalty * 2^consecutive_errors, max_backoff_penalty)`. In
+ order to avoid job which had intermetent error in teh past but since "healed"
+ if the job runs successfully after some period of time, configurable as
+ `health_threshold`, the `consecutive_errors` count is reset back to 0.
+
+ 7) If the node where replication job runs crashes, or the job is manually
+ killed via `exit(Pid, kill)`, `couch_jobs` activity monitor will automatically
+ re-enqueue the job to run as `pending`.
+
+## Replicator Job States
+
+### Description
+
+The new replicator in CouchDB 4.x will have less replication job states then before. The
+new set of states will be:
+
+ * `pending` : A job is marked as `pending` in these cases:
+    - As soon as a job is created from an `api_frontend` node
+    - Job is stopped to let other replication jobs run
+    - A filtered replication noticed its `replication_id` has changed
+
+ * `running` : When the job accepted by the `couch_jobs:accept/2` call. This
+   generally means the job is actually running on a node, however in case when
+   a node crashes immediately, the job may indicate `running` on that node
+   until `couch_jobs` activity monitor will re-enqueue the job
+
+ * `crashing` : Then job was running, but then crashed with an intermetent
+   error. Job data has an error count which is bumped and then a bakcoff pently
+   is computed and the the job is rescheduled to try again at some point in the
+   future.
+
+ * `completed` : Normal replications which have completed
+
+ * `failed` : This can happen when
+    - A replication job could not be parsed from a replication document or
+      `_replicate/` POST body. For example is the user has not specified a
+      `"source"` field.
+    - There is another persistent replication job running or pending with the
+      same `replication_id`.
+
+
+### Differences from CouchDB <= 3.x
+
+The differences from before are:
+ * `initializing` state was folded into `pending`
+ * `error` state folded into `crashing`
+
+
+### Mapping Between couch_jobs States and Replication States
+
+The state map from these states to `couch_jobs`'s states of (`pending`,
+`running`, `finished`) looks as follows:
+
+replicator state | couch_jobs state
+------------------------------------
+pending          | pending
+running          | running
+crashing         | pending
+completed        | finished
+failed           | finished
+
+### State Transition Diagram
+
+```
++-----+       +-------+
+|start+------>+pending|
++-----+       +-------+
+                  ^
+                  |
+                  |
+                  v
+              +---+---+      +--------+
+    +---------+running+----->|crashing|
+    |         +---+---+      +--------+
+    |             ^
+    |             |
+    v             v
++------+     +---------+
+|failed|     |completed|
++------+     +---------+
+```
+
+
+## Replication ID and Job IDs mapping
+
+Multiple replication jobs may specify a replications which map to the same
+`replication_id`. To handle duplicates and collission there is a `(...,
+LayerPrefix, ?REPLICATION_IDS, ReplicationID) -> JobID` subspace which keeps
+track of mappings from `replication_id` to `job id`. After the `replication_id`
+is computed each replication job checks if there is already another job pending
+or running with the same `replication_id`. If the other job is transient, then
+the current job will reschedule itself as `crashing`. If the other job is
+persistent, the current job will fail permanently as `failed`.
+
+## Transient Jobs Behavior
+
+In CouchDB <= 3.x transient replication jobs ran in memory on the nodes where
+they were assigned to run. If the node where the replication job ran crashed
+the job would simply disappear without a trace. It was up to the user's code to
+periodically monitor the job status and re-create the job. With the current
+design, `transient` jobs are saved as `couch_jobs` records in FDB and so would
+survive if dbcore nodes crashing and restarting. Technically with the new
+design it becomes possible to keep the transient job results even the failed
+and completed jobs, perhaps up to some configurable amount of time to allow
+users to retrieve their status after they stop running. However in the initial
+implementation, in order to keep maximum compatibility with the <= 3.x
+replicator, transient jobs will be removed when they fail and when they
+complete.
+
+
+## Monitoring Endpoints
+
+`_active_tasks`, `_scheduler/jobs` and `_scheduler/docs` endpoint are handled
+by traversing the replication job's data using a new `couch_jobs:fold_jobs/4`
+function and returning subsets of job's data. `_active_tasks` implementation
+already works that way and `_scheduler/*` endpoint will work similarly.
+
+
+# Advantages and Disadvantages
+
+Advantages:
+
+ * Simplicity: re-using `couch_jobs` means having a lot less code to maintain
+   in `couch_replicator`
+
+ * Simpler endpoint and monitoring implementations
+
+ * Less replication job states to keep track of
+
+ * Improved behavior for transient replications, they know survive node

Review comment:
       Thanks!
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org