You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by jo...@apache.org on 2020/02/25 05:22:58 UTC

[cassandra] branch trunk updated: Add documentation of dynamo

This is an automated email from the ASF dual-hosted git repository.

jolynch pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
     new 0a675d5  Add documentation of dynamo
0a675d5 is described below

commit 0a675d5a769b5f27c5b0dfa1fd355db69fb4ef90
Author: dvohra <dv...@yahoo.com>
AuthorDate: Mon Jan 6 18:19:39 2020 -0800

    Add documentation of dynamo
    
    Patch by Deepak Vohra; Reviewed by Joseph Lynch for CASSANDRA-15486
---
 CHANGES.txt                               |   1 +
 doc/Dockerfile                            |   2 +-
 doc/source/architecture/dynamo.rst        | 535 +++++++++++++++++++++++++-----
 doc/source/architecture/images/ring.svg   |  11 +
 doc/source/architecture/images/vnodes.svg |  11 +
 doc/source/architecture/overview.rst      |  96 +++++-
 doc/source/operating/hints.rst            |  12 +-
 doc/source/operating/read_repair.rst      |   2 +-
 doc/source/operating/snitch.rst           |   2 +
 9 files changed, 574 insertions(+), 98 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 1c626f2..0730bff 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-alpha4
+ * Add documentation of dynamo (CASSANDRA-15486)
  * Added documentation for Guarantees (CASSANDRA-15482)
  * Added documentation for audit logging (CASSANDRA-15474)
  * Unset GREP_OPTIONS (CASSANDRA-14487)
diff --git a/doc/Dockerfile b/doc/Dockerfile
index 0f109b7..8999464 100644
--- a/doc/Dockerfile
+++ b/doc/Dockerfile
@@ -18,5 +18,5 @@ RUN wget -qO - https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | a
 
 RUN apt-get clean
 
-CMD ant gen-doc \
+CMD CASSANDRA_USE_JDK11=true ant gen-doc \
     && echo "The locally built documentation can be found here:\n\n    build/html/index.html\n\n"
diff --git a/doc/source/architecture/dynamo.rst b/doc/source/architecture/dynamo.rst
index 380abc2..5b17d9a 100644
--- a/doc/source/architecture/dynamo.rst
+++ b/doc/source/architecture/dynamo.rst
@@ -15,102 +15,322 @@
 .. limitations under the License.
 
 Dynamo
-------
+======
+
+Apache Cassandra relies on a number of techniques from Amazon's `Dynamo
+<http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf>`_
+distributed storage key-value system. Each node in the Dynamo system has three
+main components:
+
+- Request coordination over a partitioned dataset
+- Ring membership and failure detection
+- A local persistence (storage) engine
+
+Cassandra primarily draws from the first two clustering components,
+while using a storage engine based on a Log Structured Merge Tree
+(`LSM <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf>`_).
+In particular, Cassandra relies on Dynamo style:
+
+- Dataset partitioning using consistent hashing
+- Multi-master replication using versioned data and tunable consistency
+- Distributed cluster membership and failure detection via a gossip protocol
+- Incremental scale-out on commodity hardware
+
+Cassandra was designed this way to meet large-scale (PiB+) business-critical
+storage requirements. In particular, as applications demanded full global
+replication of petabyte scale datasets along with always available low-latency
+reads and writes, it became imperative to design a new kind of database model
+as the relational database systems of the time struggled to meet the new
+requirements of global scale applications.
+
+Dataset Partitioning: Consistent Hashing
+----------------------------------------
+
+Cassandra achieves horizontal scalability by
+`partitioning <https://en.wikipedia.org/wiki/Partition_(database)>`_
+all data stored in the system using a hash function. Each partition is replicated
+to multiple physical nodes, often across failure domains such as racks and even
+datacenters. As every replica can independently accept mutations to every key
+that it owns, every key must be versioned. Unlike in the original Dynamo paper
+where deterministic versions and vector clocks were used to reconcile concurrent
+updates to a key, Cassandra uses a simpler last write wins model where every
+mutation is timestamped (including deletes) and then the latest version of data
+is the "winning" value. Formally speaking, Cassandra uses a Last-Write-Wins Element-Set
+conflict-free replicated data type for each CQL row (a.k.a `LWW-Element-Set CRDT
+<https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#LWW-Element-Set_(Last-Write-Wins-Element-Set)>`_)
+to resolve conflicting mutations on replica sets.
+
+ .. _consistent-hashing-token-ring:
+
+Consistent Hashing using a Token Ring
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cassandra partitions data over storage nodes using a special form of hashing
+called `consistent hashing <https://en.wikipedia.org/wiki/Consistent_hashing>`_.
+In naive data hashing, you typically allocate keys to buckets by taking a hash
+of the key modulo the number of buckets. For example, if you want to distribute
+data to 100 nodes using naive hashing you might assign every node to a bucket
+between 0 and 100, hash the input key modulo 100, and store the data on the
+associated bucket. In this naive scheme, however, adding a single node might
+invalidate almost all of the mappings.
+
+Cassandra instead maps every node to one or more tokens on a continuous hash
+ring, and defines ownership by hashing a key onto the ring and then "walking"
+the ring in one direction, similar to the `Chord
+<https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf>`_
+algorithm. The main difference of consistent hashing to naive data hashing is
+that when the number of nodes (buckets) to hash into changes, consistent
+hashing only has to move a small fraction of the keys.
+
+For example, if we have an eight node cluster with evenly spaced tokens, and
+a replication factor (RF) of 3, then to find the owning nodes for a key we
+first hash that key to generate a token (which is just the hash of the key),
+and then we "walk" the ring in a clockwise fashion until we encounter three
+distinct nodes, at which point we have found all the replicas of that key.
+This example of an eight node cluster with `RF=3` can be visualized as follows:
+
+.. figure:: images/ring.svg
+   :scale: 75 %
+   :alt: Dynamo Ring
+
+You can see that in a Dynamo like system, ranges of keys, also known as **token
+ranges**, map to the same physical set of nodes. In this example, all keys that
+fall in the token range excluding token 1 and including token 2 (`range(t1, t2]`)
+are stored on nodes 2, 3 and 4.
+
+Multiple Tokens per Physical Node (a.k.a. `vnodes`)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Simple single token consistent hashing works well if you have many physical
+nodes to spread data over, but with evenly spaced tokens and a small number of
+physical nodes, incremental scaling (adding just a few nodes of capacity) is
+difficult because there are no token selections for new nodes that can leave
+the ring balanced. Cassandra seeks to avoid token imbalance because uneven
+token ranges lead to uneven request load. For example, in the previous example
+there is no way to add a ninth token without causing imbalance; instead we
+would have to insert ``8`` tokens in the midpoints of the existing ranges.
+
+The Dynamo paper advocates for the use of "virtual nodes" to solve this
+imbalance problem. Virtual nodes solve the problem by assigning multiple
+tokens in the token ring to each physical node. By allowing a single physical
+node to take multiple positions in the ring, we can make small clusters look
+larger and therefore even with a single physical node addition we can make it
+look like we added many more nodes, effectively taking many smaller pieces of
+data from more ring neighbors when we add even a single node.
+
+Cassandra introduces some nomenclature to handle these concepts:
+
+- **Token**: A single position on the `dynamo` style hash ring.
+- **Endpoint**: A single physical IP and port on the network.
+- **Host ID**: A unique identifier for a single "physical" node, usually
+  present at one `Endpoint` and containing one or more `Tokens`.
+- **Virtual Node** (or **vnode**): A `Token` on the hash ring owned by the same
+  physical node, one with the same `Host ID`.
+
+The mapping of **Tokens** to **Endpoints** gives rise to the **Token Map**
+where Cassandra keeps track of what ring positions map to which physical
+endpoints.  For example, in the following figure we can represent an eight node
+cluster using only four physical nodes by assigning two tokens to every node:
+
+.. figure:: images/vnodes.svg
+   :scale: 75 %
+   :alt: Virtual Tokens Ring
+
+
+Multiple tokens per physical node provide the following benefits:
+
+1. When a new node is added it accepts approximately equal amounts of data from
+   other nodes in the ring, resulting in equal distribution of data across the
+   cluster.
+2. When a node is decommissioned, it loses data roughly equally to other members
+   of the ring, again keeping equal distribution of data across the cluster.
+3. If a node becomes unavailable, query load (especially token aware query load),
+   is evenly distributed across many other nodes.
+
+Multiple tokens, however, can also have disadvantages:
+
+1. Every token introduces up to ``2 * (RF - 1)`` additional neighbors on the
+   token ring, which means that there are more combinations of node failures
+   where we lose availability for a portion of the token ring. The more tokens
+   you have, `the higher the probability of an outage
+   <https://jolynch.github.io/pdf/cassandra-availability-virtual.pdf>`_.
+2. Cluster-wide maintenance operations are often slowed. For example, as the
+   number of tokens per node is increased, the number of discrete repair
+   operations the cluster must do also increases.
+3. Performance of operations that span token ranges could be affected.
+
+Note that in Cassandra ``2.x``, the only token allocation algorithm available
+was picking random tokens, which meant that to keep balance the default number
+of tokens per node had to be quite high, at ``256``. This had the effect of
+coupling many physical endpoints together, increasing the risk of
+unavailability. That is why in ``3.x +`` the new deterministic token allocator
+was added which intelligently picks tokens such that the ring is optimally
+balanced while requiring a much lower number of tokens per physical node.
+
+
+Multi-master Replication: Versioned Data and Tunable Consistency
+----------------------------------------------------------------
+
+Cassandra replicates every partition of data to many nodes across the cluster
+to maintain high availability and durability. When a mutation occurs, the
+coordinator hashes the partition key to determine the token range the data
+belongs to and then replicates the mutation to the replicas of that data
+according to the :ref:`Replication Strategy <replication-strategy>`.
+
+All replication strategies have the notion of a **replication factor** (``RF``),
+which indicates to Cassandra how many copies of the partition should exist.
+For example with a ``RF=3`` keyspace, the data will be written to three
+distinct **replicas**. Replicas are always chosen such that they are distinct
+physical nodes which is achieved by skipping virtual nodes if needed.
+Replication strategies may also choose to skip nodes present in the same failure
+domain such as racks or datacenters so that Cassandra clusters can tolerate
+failures of whole racks and even datacenters of nodes.
 
-.. _gossip:
-
-Gossip
-^^^^^^
-
-.. todo:: todo
-
-Failure Detection
-^^^^^^^^^^^^^^^^^
-
-.. todo:: todo
-
-.. _token-range:
-
-Token Ring/Ranges
-^^^^^^^^^^^^^^^^^
+.. _replication-strategy:
 
-.. todo:: todo
+Replication Strategy
+^^^^^^^^^^^^^^^^^^^^
 
-.. _replication-strategy:
+Cassandra supports pluggable **replication strategies**, which determine which
+physical nodes act as replicas for a given token range. Every keyspace of
+data has its own replication strategy. All production deployments should use
+the :ref:`network-topology-strategy` while the :ref:`simple-strategy` replication
+strategy is useful only for testing clusters where you do not yet know the
+datacenter layout of the cluster.
 
-Replication
-^^^^^^^^^^^
+.. _network-topology-strategy:
 
-The replication strategy of a keyspace determines which nodes are replicas for a given token range. The two main
-replication strategies are :ref:`simple-strategy` and :ref:`network-topology-strategy`.
+``NetworkTopologyStrategy``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``NetworkTopologyStrategy`` allows a replication factor to be specified for each
+datacenter in the cluster. Even if your cluster only uses a single datacenter,
+``NetworkTopologyStrategy`` should be preferred over ``SimpleStrategy`` to make it
+easier to add new physical or virtual datacenters to the cluster later.
+
+In addition to allowing the replication factor to be specified individually by
+datacenter, ``NetworkTopologyStrategy`` also attempts to choose replicas within a
+datacenter from different racks as specified by the :ref:`Snitch <snitch>`. If
+the number of racks is greater than or equal to the replication factor for the
+datacenter, each replica is guaranteed to be chosen from a different rack.
+Otherwise, each rack will hold at least one replica, but some racks may hold
+more than one. Note that this rack-aware behavior has some potentially
+`surprising implications
+<https://issues.apache.org/jira/browse/CASSANDRA-3810>`_.  For example, if
+there are not an even number of nodes in each rack, the data load on the
+smallest rack may be much higher.  Similarly, if a single node is bootstrapped
+into a brand new rack, it will be considered a replica for the entire ring.
+For this reason, many operators choose to configure all nodes in a single
+availability zone or similar failure domain as a single "rack".
 
 .. _simple-strategy:
 
-SimpleStrategy
-~~~~~~~~~~~~~~
+``SimpleStrategy``
+~~~~~~~~~~~~~~~~~~
 
-SimpleStrategy allows a single integer ``replication_factor`` to be defined. This determines the number of nodes that
+``SimpleStrategy`` allows a single integer ``replication_factor`` to be defined. This determines the number of nodes that
 should contain a copy of each row.  For example, if ``replication_factor`` is 3, then three different nodes should store
 a copy of each row.
 
-SimpleStrategy treats all nodes identically, ignoring any configured datacenters or racks.  To determine the replicas
+``SimpleStrategy`` treats all nodes identically, ignoring any configured datacenters or racks.  To determine the replicas
 for a token range, Cassandra iterates through the tokens in the ring, starting with the token range of interest.  For
 each token, it checks whether the owning node has been added to the set of replicas, and if it has not, it is added to
 the set.  This process continues until ``replication_factor`` distinct nodes have been added to the set of replicas.
 
-.. _network-topology-strategy:
-
-NetworkTopologyStrategy
-~~~~~~~~~~~~~~~~~~~~~~~
-
-NetworkTopologyStrategy allows a replication factor to be specified for each datacenter in the cluster.  Even if your
-cluster only uses a single datacenter, NetworkTopologyStrategy should be prefered over SimpleStrategy to make it easier
-to add new physical or virtual datacenters to the cluster later.
-
-In addition to allowing the replication factor to be specified per-DC, NetworkTopologyStrategy also attempts to choose
-replicas within a datacenter from different racks.  If the number of racks is greater than or equal to the replication
-factor for the DC, each replica will be chosen from a different rack.  Otherwise, each rack will hold at least one
-replica, but some racks may hold more than one. Note that this rack-aware behavior has some potentially `surprising
-implications <https://issues.apache.org/jira/browse/CASSANDRA-3810>`_.  For example, if there are not an even number of
-nodes in each rack, the data load on the smallest rack may be much higher.  Similarly, if a single node is bootstrapped
-into a new rack, it will be considered a replica for the entire ring.  For this reason, many operators choose to
-configure all nodes on a single "rack".
-
 .. _transient-replication:
 
 Transient Replication
 ~~~~~~~~~~~~~~~~~~~~~
 
-Transient replication allows you to configure a subset of replicas to only replicate data that hasn't been incrementally
-repaired. This allows you to decouple data redundancy from availability. For instance, if you have a keyspace replicated
-at rf 3, and alter it to rf 5 with 2 transient replicas, you go from being able to tolerate one failed replica to being
-able to tolerate two, without corresponding increase in storage usage. This is because 3 nodes will replicate all the data
-for a given token range, and the other 2 will only replicate data that hasn't been incrementally repaired.
-
-To use transient replication, you first need to enable it in ``cassandra.yaml``. Once enabled, both SimpleStrategy and
-NetworkTopologyStrategy can be configured to transiently replicate data. You configure it by specifying replication factor
-as ``<total_replicas>/<transient_replicas`` Both SimpleStrategy and NetworkTopologyStrategy support configuring transient
-replication.
-
-Transiently replicated keyspaces only support tables created with read_repair set to NONE and monotonic reads are not currently supported.
-You also can't use LWT, logged batches, and counters in 4.0. You will possibly never be able to use materialized views
-with transiently replicated keyspaces and probably never be able to use 2i with them.
-
-Transient replication is an experimental feature that may not be ready for production use. The expected audienced is experienced
-users of Cassandra capable of fully validating a deployment of their particular application. That means being able check
-that operations like reads, writes, decommission, remove, rebuild, repair, and replace all work with your queries, data,
+Transient replication is an experimental feature in Cassandra 4.0 not present
+in the original Dynamo paper. It allows you to configure a subset of replicas
+to only replicate data that hasn't been incrementally repaired. This allows you
+to decouple data redundancy from availability. For instance, if you have a
+keyspace replicated at rf 3, and alter it to rf 5 with 2 transient replicas,
+you go from being able to tolerate one failed replica to being able to tolerate
+two, without corresponding increase in storage usage. This is because 3 nodes
+will replicate all the data for a given token range, and the other 2 will only
+replicate data that hasn't been incrementally repaired.
+
+To use transient replication, you first need to enable it in
+``cassandra.yaml``. Once enabled, both ``SimpleStrategy`` and
+``NetworkTopologyStrategy`` can be configured to transiently replicate data.
+You configure it by specifying replication factor as
+``<total_replicas>/<transient_replicas`` Both ``SimpleStrategy`` and
+``NetworkTopologyStrategy`` support configuring transient replication.
+
+Transiently replicated keyspaces only support tables created with read_repair
+set to ``NONE`` and monotonic reads are not currently supported.  You also
+can't use ``LWT``, logged batches, or counters in 4.0. You will possibly never be
+able to use materialized views with transiently replicated keyspaces and
+probably never be able to use secondary indices with them.
+
+Transient replication is an experimental feature that may not be ready for
+production use. The expected audience is experienced users of Cassandra
+capable of fully validating a deployment of their particular application. That
+means being able check that operations like reads, writes, decommission,
+remove, rebuild, repair, and replace all work with your queries, data,
 configuration, operational practices, and availability requirements.
 
-It is anticipated that 4.next will support monotonic reads with transient replication as well as LWT, logged batches, and
-counters.
+It is anticipated that ``4.next`` will support monotonic reads with transient
+replication as well as LWT, logged batches, and counters.
+
+Data Versioning
+^^^^^^^^^^^^^^^
+
+Cassandra uses mutation timestamp versioning to guarantee eventual consistency of
+data. Specifically all mutations that enter the system do so with a timestamp
+provided either from a client clock or, absent a client provided timestamp,
+from the coordinator node's clock. Updates resolve according to the conflict
+resolution rule of last write wins. Cassandra's correctness does depend on
+these clocks, so make sure a proper time synchronization process is running
+such as NTP.
+
+Cassandra applies separate mutation timestamps to every column of every row
+within a CQL partition. Rows are guaranteed to be unique by primary key, and
+each column in a row resolve concurrent mutations according to last-write-wins
+conflict resolution. This means that updates to different primary keys within a
+partition can actually resolve without conflict! Furthermore the CQL collection
+types such as maps and sets use this same conflict free mechanism, meaning
+that concurrent updates to maps and sets are guaranteed to resolve as well.
+
+Replica Synchronization
+~~~~~~~~~~~~~~~~~~~~~~~
 
+As replicas in Cassandra can accept mutations independently, it is possible
+for some replicas to have newer data than others. Cassandra has many best-effort
+techniques to drive convergence of replicas including
+`Replica read repair <read-repair>` in the read path and
+`Hinted handoff <hints>` in the write path.
+
+These techniques are only best-effort, however, and to guarantee eventual
+consistency Cassandra implements `anti-entropy repair <repair>` where replicas
+calculate hierarchical hash-trees over their datasets called `Merkle Trees
+<https://en.wikipedia.org/wiki/Merkle_tree>`_ that can then be compared across
+replicas to identify mismatched data. Like the original Dynamo paper Cassandra
+supports "full" repairs where replicas hash their entire dataset, create Merkle
+trees, send them to each other and sync any ranges that don't match.
+
+Unlike the original Dynamo paper, Cassandra also implements sub-range repair
+and incremental repair. Sub-range repair allows Cassandra to increase the
+resolution of the hash trees (potentially down to the single partition level)
+by creating a larger number of trees that span only a portion of the data
+range.  Incremental repair allows Cassandra to only repair the partitions that
+have changed since the last repair.
 
 Tunable Consistency
 ^^^^^^^^^^^^^^^^^^^
 
-Cassandra supports a per-operation tradeoff between consistency and availability through *Consistency Levels*.
-Essentially, an operation's consistency level specifies how many of the replicas need to respond to the coordinator in
-order to consider the operation a success.
+Cassandra supports a per-operation tradeoff between consistency and
+availability through **Consistency Levels**. Cassandra's consistency levels
+are a version of Dynamo's ``R + W > N`` consistency mechanism where operators
+could configure the number of nodes that must participate in reads (``R``)
+and writes (``W``) to be larger than the replication factor (``N``). In
+Cassandra, you instead choose from a menu of common consistency levels which
+allow the operator to pick ``R`` and ``W`` behavior without knowing the
+replication factor. Generally writes will be visible to subsequent reads when
+the read consistency level contains enough nodes to guarantee a quorum intersection
+with the write consistency level.
 
 The following consistency levels are available:
 
@@ -144,23 +364,174 @@ The following consistency levels are available:
   attempt to replay the hint and deliver the mutation to the replicas.  This consistency level is only accepted for
   write operations.
 
-Write operations are always sent to all replicas, regardless of consistency level. The consistency level simply
-controls how many responses the coordinator waits for before responding to the client.
+Write operations **are always sent to all replicas**, regardless of consistency
+level. The consistency level simply controls how many responses the coordinator
+waits for before responding to the client.
 
-For read operations, the coordinator generally only issues read commands to enough replicas to satisfy the consistency
-level, with one exception. Speculative retry may issue a redundant read request to an extra replica if the other replicas
-have not responded within a specified time window.
+For read operations, the coordinator generally only issues read commands to
+enough replicas to satisfy the consistency level. The one exception to this is
+when speculative retry may issue a redundant read request to an extra replica
+if the original replicas have not responded within a specified time window.
 
 Picking Consistency Levels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-It is common to pick read and write consistency levels that are high enough to overlap, resulting in "strong"
-consistency.  This is typically expressed as ``W + R > RF``, where ``W`` is the write consistency level, ``R`` is the
-read consistency level, and ``RF`` is the replication factor.  For example, if ``RF = 3``, a ``QUORUM`` request will
-require responses from at least two of the three replicas.  If ``QUORUM`` is used for both writes and reads, at least
-one of the replicas is guaranteed to participate in *both* the write and the read request, which in turn guarantees that
-the latest write will be read. In a multi-datacenter environment, ``LOCAL_QUORUM`` can be used to provide a weaker but
-still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.
+It is common to pick read and write consistency levels such that the replica
+sets overlap, resulting in all acknowledged writes being visible to subsequent
+reads. This is typically expressed in the same terms Dynamo does, in that ``W +
+R > RF``, where ``W`` is the write consistency level, ``R`` is the read
+consistency level, and ``RF`` is the replication factor.  For example, if ``RF
+= 3``, a ``QUORUM`` request will require responses from at least ``2/3``
+replicas.  If ``QUORUM`` is used for both writes and reads, at least one of the
+replicas is guaranteed to participate in *both* the write and the read request,
+which in turn guarantees that the quorums will overlap and the write will be
+visible to the read.
+
+In a multi-datacenter environment, ``LOCAL_QUORUM`` can be used to provide a
+weaker but still useful guarantee: reads are guaranteed to see the latest write
+from within the same datacenter. This is often sufficient as clients homed to
+a single datacenter will read their own writes.
+
+If this type of strong consistency isn't required, lower consistency levels
+like ``LOCAL_ONE`` or ``ONE`` may be used to improve throughput, latency, and
+availability. With replication spanning multiple datacenters, ``LOCAL_ONE`` is
+typically less available than ``ONE`` but is faster as a rule. Indeed ``ONE``
+will succeed if a single replica is available in any datacenter.
+
+Distributed Cluster Membership and Failure Detection
+----------------------------------------------------
+
+The replication protocols and dataset partitioning rely on knowing which nodes
+are alive and dead in the cluster so that write and read operations can be
+optimally routed. In Cassandra liveness information is shared in a distributed
+fashion through a failure detection mechanism based on a gossip protocol.
+
+.. _gossip:
+
+Gossip
+^^^^^^
 
-If this type of strong consistency isn't required, lower consistency levels like ``ONE`` may be used to improve
-throughput, latency, and availability.
+Gossip is how Cassandra propagates basic cluster bootstrapping information such
+as endpoint membership and internode network protocol versions. In Cassandra's
+gossip system, nodes exchange state information not only about themselves but
+also about other nodes they know about. This information is versioned with a
+vector clock of ``(generation, version)`` tuples, where the generation is a
+monotonic timestamp and version is a logical clock the increments roughly every
+second. These logical clocks allow Cassandra gossip to ignore old versions of
+cluster state just by inspecting the logical clocks presented with gossip
+messages.
+
+Every node in the Cassandra cluster runs the gossip task independently and
+periodically. Every second, every node in the cluster:
+
+1. Updates the local node's heartbeat state (the version) and constructs the
+   node's local view of the cluster gossip endpoint state.
+2. Picks a random other node in the cluster to exchange gossip endpoint state
+   with.
+3. Probabilistically attempts to gossip with any unreachable nodes (if one exists)
+4. Gossips with a seed node if that didn't happen in step 2.
+
+When an operator first bootstraps a Cassandra cluster they designate certain
+nodes as "seed" nodes. Any node can be a seed node and the only difference
+between seed and non-seed nodes is seed nodes are allowed to bootstrap into the
+ring without seeing any other seed nodes. Furthermore, once a cluster is
+bootstrapped, seed nodes become "hotspots" for gossip due to step 4 above.
+
+As non-seed nodes must be able to contact at least one seed node in order to
+bootstrap into the cluster, it is common to include multiple seed nodes, often
+one for each rack or datacenter. Seed nodes are often chosen using existing
+off-the-shelf service discovery mechanisms.
+
+.. note::
+   Nodes do not have to agree on the seed nodes, and indeed once a cluster is
+   bootstrapped, newly launched nodes can be configured to use any existing
+   nodes as "seeds". The only advantage to picking the same nodes as seeds
+   is it increases their usefullness as gossip hotspots.
+
+Currently, gossip also propagates token metadata and schema *version*
+information. This information forms the control plane for scheduling data
+movements and schema pulls. For example, if a node sees a mismatch in schema
+version in gossip state, it will schedule a schema sync task with the other
+nodes. As token information propagates via gossip it is also the control plane
+for teaching nodes which endpoints own what data.
+
+Ring Membership and Failure Detection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Gossip forms the basis of ring membership, but the **failure detector**
+ultimately makes decisions about if nodes are ``UP`` or ``DOWN``. Every node in
+Cassandra runs a variant of the `Phi Accrual Failure Detector
+<https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNvT2phv>`_,
+in which every node is constantly making an independent decision of if their
+peer nodes are available or not. This decision is primarily based on received
+heartbeat state. For example, if a node does not see an increasing heartbeat
+from a node for a certain amount of time, the failure detector "convicts" that
+node, at which point Cassandra will stop routing reads to it (writes will
+typically be written to hints). If/when the node starts heartbeating again,
+Cassandra will try to reach out and connect, and if it can open communication
+channels it will mark that node as available.
+
+.. note::
+   UP and DOWN state are local node decisions and are not propagated with
+   gossip. Heartbeat state is propagated with gossip, but nodes will not
+   consider each other as "UP" until they can successfully message each other
+   over an actual network channel.
+
+Cassandra will never remove a node from gossip state without explicit
+instruction from an operator via a decommission operation or a new node
+bootstrapping with a ``replace_address_first_boot`` option. This choice is
+intentional to allow Cassandra nodes to temporarily fail without causing data
+to needlessly re-balance. This also helps to prevent simultaneous range
+movements, where multiple replicas of a token range are moving at the same
+time, which can violate monotonic consistency and can even cause data loss.
+
+Incremental Scale-out on Commodity Hardware
+--------------------------------------------
+
+Cassandra scales-out to meet the requirements of growth in data size and
+request rates. Scaling-out means adding additional nodes to the ring, and
+every additional node brings linear improvements in compute and storage. In
+contrast, scaling-up implies adding more capacity to the existing database
+nodes. Cassandra is also capable of scale-up, and in certain environments it
+may be preferable depending on the deployment. Cassandra gives operators the
+flexibility to chose either scale-out or scale-up.
+
+One key aspect of Dynamo that Cassandra follows is to attempt to run on
+commodity hardware, and many engineering choices are made under this
+assumption. For example, Cassandra assumes nodes can fail at any time,
+auto-tunes to make the best use of CPU and memory resources available and makes
+heavy use of advanced compression and caching techniques to get the most
+storage out of limited memory and storage capabilities.
+
+Simple Query Model
+^^^^^^^^^^^^^^^^^^
+
+Cassandra, like Dynamo, chooses not to provide cross-partition transactions
+that are common in SQL Relational Database Management Systems (RDBMS). This
+both gives the programmer a simpler read and write API, and allows Cassandra to
+more easily scale horizontally since multi-partition transactions spanning
+multiple nodes are notoriously difficult to implement and typically very
+latent.
+
+Instead, Cassanda chooses to offer fast, consistent, latency at any scale for
+single partition operations, allowing retrieval of entire partitions or only
+subsets of partitions based on primary key filters. Furthermore, Cassandra does
+support single partition compare and swap functionality via the lightweight
+transaction CQL API.
+
+Simple Interface for Storing Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cassandra, in a slight departure from Dynamo, chooses a storage interface that
+is more sophisticated then "simple key value" stores but significantly less
+complex than SQL relational data models.  Cassandra presents a wide-column
+store interface, where partitions of data contain multiple rows, each of which
+contains a flexible set of individually typed columns. Every row is uniquely
+identified by the partition key and one or more clustering keys, and every row
+can have as many columns as needed.
+
+This allows users to flexibly add new columns to existing datasets as new
+requirements surface. Schema changes involve only metadata changes and run
+fully concurrently with live workloads. Therefore, users can safely add columns
+to existing Cassandra databases while remaining confident that query
+performance will not degrade.
diff --git a/doc/source/architecture/images/ring.svg b/doc/source/architecture/images/ring.svg
new file mode 100644
index 0000000..d0db8c5
--- /dev/null
+++ b/doc/source/architecture/images/ring.svg
@@ -0,0 +1,11 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="651" height="709.4583740234375" style="
+        width:651px;
+        height:709.4583740234375px;
+        background: transparent;
+        fill: none;
+">
+        
+        
+        <svg xmlns="http://www.w3.org/2000/svg" class="role-diagram-draw-area"><g class="shapes-region" style="stroke: black; fill: none;"><g class="composite-shape"><path class="real" d=" M223.5,655 C223.5,634.84 239.84,618.5 260,618.5 C280.16,618.5 296.5,634.84 296.5,655 C296.5,675.16 280.16,691.5 260,691.5 C239.84,691.5 223.5,675.16 223.5,655 Z" style="stroke-width: 1; stroke: rgb(103, 148, 135); fill: rgb(103, 148, 135);"/></g><g class="composite-shape"><path class="real" d=" M229.26 [...]
+        <svg xmlns="http://www.w3.org/2000/svg" width="649" height="707.4583740234375" style="width:649px;height:707.4583740234375px;font-family:Asana-Math, Asana;background:transparent;"><g><g><g><g><g><g style="transform:matrix(1,0,0,1,12.171875,40.31333587646485);"><path d="M175 386L316 386L316 444L175 444L175 571L106 571L106 444L19 444L19 386L103 386L103 119C103 59 117 -11 186 -11C256 -11 307 14 332 27L316 86C290 65 258 53 226 53C189 53 175 83 175 136ZM829 220C829 354 729 461 610 461 [...]
+</svg>
diff --git a/doc/source/architecture/images/vnodes.svg b/doc/source/architecture/images/vnodes.svg
new file mode 100644
index 0000000..71b4fa2
--- /dev/null
+++ b/doc/source/architecture/images/vnodes.svg
@@ -0,0 +1,11 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="651" height="384.66668701171875" style="
+        width:651px;
+        height:384.66668701171875px;
+        background: transparent;
+        fill: none;
+">
+        
+        
+        <svg xmlns="http://www.w3.org/2000/svg" class="role-diagram-draw-area"><g class="shapes-region" style="stroke: black; fill: none;"><g class="composite-shape"><path class="real" d=" M40.4,190 C40.4,107.38 107.38,40.4 190,40.4 C272.62,40.4 339.6,107.38 339.6,190 C339.6,272.62 272.62,339.6 190,339.6 C107.38,339.6 40.4,272.62 40.4,190 Z" style="stroke-width: 1; stroke: rgba(0, 0, 0, 0.52); fill: none; stroke-dasharray: 1.125, 3.35;"/></g><g class="composite-shape"><path class="real"  [...]
+        <svg xmlns="http://www.w3.org/2000/svg" width="649" height="382.66668701171875" style="width:649px;height:382.66668701171875px;font-family:Asana-Math, Asana;background:transparent;"><g><g><g><g><g><g style="transform:matrix(1,0,0,1,178.65625,348.9985620117188);"><path d="M125 390L69 107C68 99 56 61 56 31C56 6 67 -9 86 -9C121 -9 156 11 234 74L265 99L255 117L210 86C181 66 161 56 150 56C141 56 136 64 136 76C136 102 150 183 179 328L192 390L299 390L310 440C272 436 238 434 200 434C216  [...]
+</svg>
diff --git a/doc/source/architecture/overview.rst b/doc/source/architecture/overview.rst
index eba8668..e5fcbe3 100644
--- a/doc/source/architecture/overview.rst
+++ b/doc/source/architecture/overview.rst
@@ -17,20 +17,98 @@
 .. _overview:
 
 Overview
-==========
+========
 
+Apache Cassandra is an open source, distributed, NoSQL database. It presents
+a partitioned wide column storage model with eventually consistent semantics.
 
-Apache Cassandra is an open source, distributed, NoSQL database based on the key/value data storage model. Cassandra is based on a query-driven data model.  Data is stored in tables in rows and columns.  Cassandra is designed to meet the requirements of web-scale storage in which reliability, scalability and high availability have the highest priority. Apache Cassandra is eventually consistent, which implies that updates are eventually received by all replicas. Cassandra has provision to [...]
+Apache Cassandra was initially designed at `Facebook
+<https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf>`_
+using a staged event-driven architecture (`SEDA
+<http://www.sosp.org/2001/papers/welsh.pdf>`_) to implement a combination of
+Amazon’s `Dynamo
+<http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf>`_
+distributed storage and replication techniques combined with Google's `Bigtable
+<https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf>`_
+data and storage engine model. Dynamo and Bigtable were both developed to meet
+emerging requirements for scalable, reliable and highly available storage
+systems, but each had areas that could be improved.
 
-Cassandra supports lightweight transactions with compare and set but does not support ACID (Atomicity, Consistency, Isolation, Durability) transaction properties. Cassandra does not support joins, foreign keys and has no concept of referential integrity.
+Cassandra was designed as a best in class combination of both systems to meet
+emerging large scale, both in data footprint and query volume, storage
+requirements. As applications began to require full global replication and
+always available low-latency reads and writes, it became imperative to design a
+new kind of database model as the relational database systems of the time
+struggled to meet the new requirements of global scale applications.
 
-Apache Cassandra data definition supports keyspace as a top-level database object that contains tables  and other database objects such as materialized views, user-defined types, functions and aggregates. Replication is managed at the keyspace level at a given data center.  Data definition language (DDL) is provided for keyspaces and tables.
+Systems like Cassandra are designed for these challenges and seek the
+following design objectives:
 
-Cassandra supports the Cassandra Query Language (CQL), an SQL-like language that provides data definition syntax for all the database objects including keyspaces, tables, materialized views, indexes and snapshots. CQL statements are run from the cqlsh interface.
+- Full multi-master database replication
+- Global availability at low latency
+- Scaling out on commodity hardware
+- Linear throughput increase with each additional processor
+- Online load balancing and cluster growth
+- Partitioned key-oriented queries
+- Flexible schema
 
-Apache Cassandra configuration settings are configured in cassandra.yaml file that may be edited in a vi editor. Cassandra cluster must be restarted after modifying ``cassandra.yaml``.
+Features
+--------
 
-Cassandra supports backups using snapshots. Automatic Incremental backups may also be used.
-Apache Cassandra 4.0 has added several new features including virtual tables. transient replication, audit logging, full query logging, and support for Java 11. Two of these features are experimental: transient replication and Java 11 support.
+Cassandra provides the Cassandra Query Language (CQL), an SQL-like language,
+to create and update database schema and access data. CQL allows users to
+organize data within a cluster of Cassandra nodes using:
 
-Cassandra provides tools for managing a cluster. The Nodetool is used to manage the cluster as a whole. The ``nodetool`` command line options override the configuration settings in ``cassandra.yaml``.  The ``auditlogviewer`` is used to view the audit logs. The  ``fqltool`` is used to view, replay and compare full query logs.  The ``auditlogviewer`` and ``fqltool`` are new tools in Apache Cassandra 4.0.
+- **Keyspace**: defines how a dataset is replicated, for example in which
+  datacenters and how many copies. Keyspaces contain tables.
+- **Table**: defines the typed schema for a collection of partitions. Cassandra
+  tables have flexible addition of new columns to tables with zero downtime.
+  Tables contain partitions, which contain partitions, which contain columns.
+- **Partition**: defines the mandatory part of the primary key all rows in
+  Cassandra must have. All performant queries supply the partition key in
+  the query.
+- **Row**: contains a collection of columns identified by a unique primary key
+  made up of the partition key and optionally additional clustering keys.
+- **Column**: A single datum with a type which belong to a row.
+
+CQL supports numerous advanced features over a partitioned dataset such as:
+
+- Single partition lightweight transactions with atomic compare and set
+  semantics.
+- User-defined types, functions and aggregates
+- Collection types including sets, maps, and lists.
+- Local secondary indices
+- (Experimental) materialized views
+
+Cassandra explicitly chooses not to implement operations that require cross
+partition coordination as they are typically slow and hard to provide highly
+available global semantics. For example Cassandra does not support:
+
+- Cross partition transactions
+- Distributed joins
+- Foreign keys or referential integrity.
+
+Operating
+---------
+
+Apache Cassandra configuration settings are configured in the ``cassandra.yaml``
+file that can be edited by hand or with the aid of configuration management tools.
+Some settings can be manipulated live using an online interface, but others
+require a restart of the database to take effect.
+
+Cassandra provides tools for managing a cluster. The ``nodetool`` command
+interacts with Cassandra's live control interface, allowing runtime manipulation
+of many settings from ``cassandra.yaml``. The ``auditlogviewer`` is used
+to view the audit logs. The  ``fqltool`` is used to view, replay and compare
+full query logs.  The ``auditlogviewer`` and ``fqltool`` are new tools in
+Apache Cassandra 4.0.
+
+In addition, Cassandra supports out of the box atomic snapshot functionality,
+which presents a point in time snapshot of Cassandra's data for easy
+integration with many backup tools. Cassandra also supports incremental backups
+where data can be backed up as it is written.
+
+Apache Cassandra 4.0 has added several new features including virtual tables.
+transient replication, audit logging, full query logging, and support for Java
+11. Two of these features are experimental: transient replication and Java 11
+support.
diff --git a/doc/source/operating/hints.rst b/doc/source/operating/hints.rst
index 94ff16f..55c42a4 100644
--- a/doc/source/operating/hints.rst
+++ b/doc/source/operating/hints.rst
@@ -16,6 +16,8 @@
 
 .. highlight:: none
 
+.. _hints:
+
 Hints
 =====
 
@@ -31,11 +33,11 @@ and do not guarantee eventual consistency like :ref:`anti-entropy repair
 
 Hints are useful because of how Apache Cassandra replicates data to provide
 fault tolerance, high availability and durability. Cassandra :ref:`partitions
-data across the cluster <token-range>` using consistent hashing, and then
-replicates keys to multiple nodes along the hash ring. To guarantee
-availability, all replicas of a key can accept mutations without consensus, but
-this means it is possible for some replicas to accept a mutation while others
-do not. When this happens an inconsistency is introduced.
+data across the cluster <consistent-hashing-token-ring>` using consistent
+hashing, and then replicates keys to multiple nodes along the hash ring. To
+guarantee availability, all replicas of a key can accept mutations without
+consensus, but this means it is possible for some replicas to accept a mutation
+while others do not. When this happens an inconsistency is introduced.
 
 Hints are one of the three ways, in addition to read-repair and
 full/incremental anti-entropy repair, that Cassandra implements the eventual
diff --git a/doc/source/operating/read_repair.rst b/doc/source/operating/read_repair.rst
index 86872af..d280162 100644
--- a/doc/source/operating/read_repair.rst
+++ b/doc/source/operating/read_repair.rst
@@ -16,7 +16,7 @@
 
 .. highlight:: none
 
-.. read_repair:
+.. _read-repair:
 
 Read repair
 ==============
diff --git a/doc/source/operating/snitch.rst b/doc/source/operating/snitch.rst
index 5f6760a..85bd9e8 100644
--- a/doc/source/operating/snitch.rst
+++ b/doc/source/operating/snitch.rst
@@ -16,6 +16,8 @@
 
 .. highlight:: none
 
+.. _snitch:
+
 Snitch
 ------
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org