You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2019/06/03 19:15:07 UTC

[GitHub] [couchdb-documentation] davisp commented on a change in pull request #409: RFC for CouchDB background workers

davisp commented on a change in pull request #409: RFC for CouchDB background workers
URL: https://github.com/apache/couchdb-documentation/pull/409#discussion_r289965995
 
 

 ##########
 File path: rfcs/007-background-jobs.md
 ##########
 @@ -0,0 +1,350 @@
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: 'Background jobs with FoundationDB'
+labels: rfc, discussion
+assignees: ''
+
+---
+
+[NOTE]: # ( ^^ Provide a general summary of the RFC in the title above. ^^ )
+
+# Introduction
+
+This document describes a data model, implementation, and an API for running
+CouchDB background jobs with FoundationDB.
+
+## Abstract
+
+CouchDB background jobs are used for things like index building, replication
+and couch-peruser processing. We present a generalized model which allows
+creation, running, and monitoring of these jobs.
+
+The document starts with a description of the framework API in Erlang
+pseudo-code, then we show the data model, followed by the implementation
+details.
+
+## Requirements Language
+
+[NOTE]: # ( Do not alter the section below. Follow its instructions. )
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
+document are to be interpreted as described in
+[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+---
+
+`Job`: A unit of work, identified by a `JobId` and also having a `Type`.
+
+`Worker` : A language-specific execution unit that runs the job. Could be an
+Erlang process, a thread, or just a function.
+
+`Job table`: An FDB subspace holding the list of jobs.
+
+`Pending job`: A job that is waiting to run.
+
+`Pending queue` : A queue of pending jobs ordered by priority.
+
+`Running job`: A job which is currently executing. To be considered "running"
+the worker must periodically update the job's state in the global job table.
+
+`Priority`: A job's priority specifies its order in the pending queue. Priority
+can by any term that can be encoded as a key in the FoundationDB's tuple layer. The
+exact value of `Priority` is job type specific. It MAY be a rough timestamp, a
+`Sequence`, a list of tags, etc.
+
+`Job re-submission` : Re-submitting a job means putting a previously running
+job back into the pending queue.
+
+`Activity monitor` : Functionality implemented by the framework which checks
+job liveness (activity). If workers don't update their status often enough,
+activity monitor will re-enqueue their jobs as pending. This ensures jobs make
+progress even if some workers terminate unexpectedly.
+
+`JobState`: Describes the current state of the job. The possible values are
+`"running"`, `"pending"`, and `"finished"`. These are the minimal number of
+states needed to describe a job's behavior in respect to this framework. Each
+job type MAY have additional, type specific states, such as `"failed`",
+`"error"`, `"retrying"`, etc.
+
+`Sequence`: a 13 byte value formed by combining the current `Incarnation` of
+the database and the `Versionstamp` of the transaction. Sequences are
+monotonically increasing even when a database is relocated across FoundationDB
+clusters. See (RFC002) for a full explanation.
+
+---
+
+# Framework API
+
+This section describes the job creation and worker implementation APIs. It doesn't
+describe how the framework is implemented. The intended audience is CouchDB
+developers using this framework to implement background jobs for indexing,
+replication, and couch-peruser.
+
+Both the job creation and the worker implementation APIs use a `JobOpts` map to
+represent a job. It MAY also contain these top level fields:
+
+  * `"priority"` : The value of this field will contain the `Priority` value of
+    the job. `Priority` is job-type specific.
+  * `"data"`: An opaque object (map), from the framework's point of view,
+    containing job-type specific data. It MAY contain an update sequence, or an
+    error message, for example.
+  * `"cancel"` : Boolean field defaulting to `false`. If `true` indicates the
+    user intends to stop a job's execution.
+  * `"resubmit"` : Boolean field defaulting to `false`. If `true` indicates
+    the job should be re-submitted.
+
+### Job Creation API ###
+
+```
+add(Type, JobId, JobOpts) -> ok | {error, Error}
+```
+ - Add a job to be executed by a background worker.
+
+```
+remove(Type, JobId) -> ok | not_found
+```
+ - Remove a job. If it is running, it will be stopped, then it will be removed
+   from the job table.
+
+```
+resubmit(Type, JobId) -> ok | not_found
+```
+ - Indicates that the job should be re-submitted for execution.
+
+```
+get_job(Type, JobId) -> {ok, JobOpts, JobState}
+```
+ - Return `JobOpts` and the `JobState`. `JobState` value MAY be:
+  * `"pending"` : This job is pending.
+  * `"running"` : This job is currently running.
+  * `"finished"` : This job has finished running and is not pending.
+
+### Worker Implementation API
+
+This API is to be used when implementing workers for various job types. The general pattern
+is to call `accept()` from something like a job manager, then for each accepted
+job spawn a worker to execute it, and then resume calling `accept()` to get
+other jobs. When a job is running, the worker MUST periodically call `update()`
+to prevent the activity monitor from re-enqueueing it. When the worker decides to stop
+running a job, they MUST call `finish()` to indicate that the job has finished running.
+
+```
+accept(Type[, MaxPriority]) -> {ok, JobId, WorkerLockId} | not_found
+```
+
+ - Dequeue a job from the pending queue. `WorkerLockId` is a UUID indicating
+   that the job is owned exclusively by the worker. `WorkerLockId` will be
+   passed as an argument to other API functions below and will be used to
+   verify that the current worker is still the only worker executing that job.
+   `MaxPriority` is an optional value which can limit the maximum job priority
+   that will be accepted. Jobs with priorities higher than that will not be
+   accepted. The intended usage is to allow `Priority` to be used as a way to
+   schedule job for executing at a future date. For example, `Priority` MAY
+   indicate that a replication job which has been repeatedly failing should
+   not execute any sooner than one hour from now.
+
+```
+finish(Tx, Type, JobId, JobOpts, WorkerLockId) -> ok | worker_conflict | canceled
+```
+
+ - Called by the worker when the job has finished running. The `"data"` field
+   in `JobOpts` MAY contain a final result field or information about a fatal
+   error.
+
+```
+resubmit(Tx, Type, JobId, WorkerLockId) -> ok | worker_conflict | canceled
+
+```
+ - The worker MAY call this function in order to mark the job for
+   re-submission. This MAY be used to penalize jobs which have been running for
+   too long, or jobs which have been repeatedly failing. Note that both the
+   user and a worker can request a job to be re-submitted.
+
+```
+update(Tx, Type, JobId, JobOpts, WorkerLockId) -> ok | worker_conflict | canceled
+```
+ - This MAY be called to update a job's progress (update sequence, number of
+   changes left, etc.) This function MUST be called at least as often as the
+   `ActivityTimeout` in order for the activity monitor to not re-submit the
+   job into the pending queue due to inactivity.
+
+When the functions above return a `worker_conflict` value it means the activity
+monitor has already re-enqueued the job. It is now either pending or is executed
+by another worker. In that case, the caller MUST stop running the job, and they
+MUST NOT update the job's entry any longer.
+
+If `canceled` is returned, it means the user has requested the job to stop
+executing. In that case the worker MUST top running the job.
+
+#### Mutual Exclusion
+
+If each worker updates their job status on time, and the activity monitor is
+running correctly, the same job SHOULD be executed by no more than one worker
+at a time. But, if the worker process is blocked for too long (for instance, in
+an overload scenario), it may fail to update its status often enough, and the
+activity monitor MAY re-enqueue the job. However, even in such cases, the
+mutual exclusion constraint can be maintained as long as `update(Tx,...)`
+function is called in the same transaction as the type-specific DB writes, and
+its result is checked for the `worker_conflict` return.
+
+# Implementation Details
+
+## Data Model
+
+`("couch_jobs", "data", Type, JobId) = (Sequence, WorkerLockId, JobOpts)`
+`("couch_jobs", "pending", Type, Priority, JobId) = ""`
+`("couch_jobs", "watches", Type) = Sequence`
+`("couch_jobs", "activity_timeout", Type) = ActivityTimeout`
+`("couch_jobs", "activity", Type, Sequence) = JobId`
 
 Review comment:
   These aren't rendered properly as a list. Pretty sure you need to add asterisks to make it an unordered list.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services