You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by ru...@apache.org on 2020/02/14 19:22:56 UTC

[cassandra] branch trunk updated: Improved read repair documentation.

This is an automated email from the ASF dual-hosted git repository.

rustyrazorblade pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
     new ff6aa65  Improved read repair documentation.
ff6aa65 is described below

commit ff6aa659571102b0a27172a482146f11da7cba31
Author: dvohra <dv...@yahoo.com>
AuthorDate: Mon Jan 6 18:19:39 2020 -0800

    Improved read repair documentation.
    
    Patch by Deepak Vohra; Reviewed by Jon Haddad for CASSANDRA-15485
---
 CHANGES.txt                                   |   1 +
 doc/source/new/Figure_1.jpg                   | Bin 0 -> 27827 bytes
 doc/source/new/Figure_2.jpg                   | Bin 0 -> 36650 bytes
 doc/source/operating/Figure_1_read_repair.jpg | Bin 0 -> 36919 bytes
 doc/source/operating/Figure_2_read_repair.jpg | Bin 0 -> 45595 bytes
 doc/source/operating/Figure_3_read_repair.jpg | Bin 0 -> 43021 bytes
 doc/source/operating/Figure_4_read_repair.jpg | Bin 0 -> 43021 bytes
 doc/source/operating/Figure_5_read_repair.jpg | Bin 0 -> 42560 bytes
 doc/source/operating/Figure_6_read_repair.jpg | Bin 0 -> 57489 bytes
 doc/source/operating/read_repair.rst          | 151 +++++++++++++++++++++++++-
 doc/source/operating/repair.rst               | 101 +++++++++++++++++
 11 files changed, 251 insertions(+), 2 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 8762a02..c4481ae 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-alpha4
+ * Added documentation for read repair and an example of full repair (CASSANDRA-15485)
  * Make cqlsh and cqlshlib Python 2 & 3 compatible (CASSANDRA-10190)
  * Added documentation for Full Query Logging (CASSANDRA-15475)
  * Added documentation for backups (CASSANDRA-15479)
diff --git a/doc/source/new/Figure_1.jpg b/doc/source/new/Figure_1.jpg
new file mode 100644
index 0000000..ccaec67
Binary files /dev/null and b/doc/source/new/Figure_1.jpg differ
diff --git a/doc/source/new/Figure_2.jpg b/doc/source/new/Figure_2.jpg
new file mode 100644
index 0000000..099e15f
Binary files /dev/null and b/doc/source/new/Figure_2.jpg differ
diff --git a/doc/source/operating/Figure_1_read_repair.jpg b/doc/source/operating/Figure_1_read_repair.jpg
new file mode 100644
index 0000000..d771550
Binary files /dev/null and b/doc/source/operating/Figure_1_read_repair.jpg differ
diff --git a/doc/source/operating/Figure_2_read_repair.jpg b/doc/source/operating/Figure_2_read_repair.jpg
new file mode 100644
index 0000000..29a912b
Binary files /dev/null and b/doc/source/operating/Figure_2_read_repair.jpg differ
diff --git a/doc/source/operating/Figure_3_read_repair.jpg b/doc/source/operating/Figure_3_read_repair.jpg
new file mode 100644
index 0000000..f5cc189
Binary files /dev/null and b/doc/source/operating/Figure_3_read_repair.jpg differ
diff --git a/doc/source/operating/Figure_4_read_repair.jpg b/doc/source/operating/Figure_4_read_repair.jpg
new file mode 100644
index 0000000..25bdb34
Binary files /dev/null and b/doc/source/operating/Figure_4_read_repair.jpg differ
diff --git a/doc/source/operating/Figure_5_read_repair.jpg b/doc/source/operating/Figure_5_read_repair.jpg
new file mode 100644
index 0000000..d9c0485
Binary files /dev/null and b/doc/source/operating/Figure_5_read_repair.jpg differ
diff --git a/doc/source/operating/Figure_6_read_repair.jpg b/doc/source/operating/Figure_6_read_repair.jpg
new file mode 100644
index 0000000..6bb4d1e
Binary files /dev/null and b/doc/source/operating/Figure_6_read_repair.jpg differ
diff --git a/doc/source/operating/read_repair.rst b/doc/source/operating/read_repair.rst
index 0e52bf5..86872af 100644
--- a/doc/source/operating/read_repair.rst
+++ b/doc/source/operating/read_repair.rst
@@ -16,7 +16,154 @@
 
 .. highlight:: none
 
+.. read_repair:
+
 Read repair
------------
+==============
+Read Repair is the process of repairing data replicas during a read request. If all replicas involved in a read request at the given read consistency level are consistent the data is returned to the client and no read repair is needed. But if the replicas involved in a read request at the given consistency level are not consistent a read repair is performed to make replicas involved in the read request consistent. The most up-to-date data is returned to the client. The read repair runs i [...]
+
+Expectation of Monotonic Quorum Reads
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Cassandra uses a blocking read repair to ensure the expectation of "monotonic quorum reads" i.e. that in 2 successive quorum reads, it’s guaranteed the 2nd one won't get something older than the 1st one, and this even if a failed quorum write made a write of the most up to date value only to a minority of replicas. "Quorum" means majority of nodes among replicas.
+
+Table level configuration of monotonic reads
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Cassandra 4.0 adds support for table level configuration of monotonic reads (`CASSANDRA-14635
+<https://issues.apache.org/jira/browse/CASSANDRA-14635>`_). The ``read_repair`` table option has been added to table schema, with the options ``blocking`` (default), and ``none``.
+
+The ``read_repair`` option configures the read repair behavior to allow tuning for various performance and consistency behaviors. Two consistency properties are affected by read repair behavior.
+
+- Monotonic Quorum Reads: Provided by ``BLOCKING``. Monotonic quorum reads prevents reads from appearing to go back in time in some circumstances. When monotonic quorum reads are not provided and a write fails to reach a quorum of replicas, it may be visible in one read, and then disappear in a subsequent read.
+- Write Atomicity: Provided by ``NONE``. Write atomicity prevents reads from returning partially applied writes. Cassandra attempts to provide partition level write atomicity, but since only the data covered by a ``SELECT`` statement is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it is written. For example read repair can break write atomicity if you write multiple rows to a clustered partition in a batch, but then sele [...]
+
+The available read repair settings are:
+
+Blocking
+*********
+The default setting. When ``read_repair`` is set to ``BLOCKING``, and a read repair is started, the read will block on writes sent to other replicas until the CL is reached by the writes. Provides monotonic quorum reads, but not partition level write atomicity.
+
+None
+*********
+When ``read_repair`` is set to ``NONE``, the coordinator will reconcile any differences between replicas, but will not attempt to repair them. Provides partition level write atomicity, but not monotonic quorum reads.
+
+An example of using the ``NONE`` setting for the ``read_repair`` option is as follows:
+
+::
+
+ CREATE TABLE ks.tbl (k INT, c INT, v INT, PRIMARY KEY (k,c)) with read_repair='NONE'");
+
+Read Repair Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+To illustrate read repair with an example, consider that a client sends a read request with read consistency level ``TWO`` to a 5-node cluster as illustrated in Figure 1. Read consistency level determines how many replica nodes must return a response before the read request is considered successful.
+
+
+.. figure:: Figure_1_read_repair.jpg
+
+
+Figure 1. Client sends read request to a 5-node Cluster
+
+Three nodes host replicas for the requested data as illustrated in Figure 2. With a read consistency level of ``TWO`` two replica nodes must return a response for the read request to be considered successful. If the node the client sends request to hosts a replica of the data requested only one other replica node needs to be sent a read request to. But if the receiving node does not host a replica for the requested data the node becomes a coordinator node and forwards the read request to [...]
+
+.. figure:: Figure_2_read_repair.jpg
+
+Figure 2. Direct Read Request sent to Fastest Replica Node
+
+Next, the coordinator node sends the requisite number of additional requests to satisfy the consistency level, which is ``TWO``. The coordinator node needs to send one more read request for a total of two. All read requests additional to the first direct read request are digest read requests. A digest read request is not a full read and only returns the hash value of the data. Only a hash value is returned to reduce the network data traffic. In the example being discussed the coordinator [...]
+
+.. figure:: Figure_3_read_repair.jpg
+
+Figure 3. Coordinator Sends a Digest Read Request
+
+The coordinator node has received a full copy of data from one node and a hash value for the data from another node. To compare the data returned a hash value is calculated for the  full copy of data. The two hash values are compared. If the hash values are the same no read repair is needed and the full copy of requested data is returned to the client. The coordinator node only performed a total of two replica read request because the read consistency level is ``TWO`` in the example. If  [...]
+
+But, if the hash value/s from the digest read request/s are not the same as the hash value of the data from the full read request of the first replica node it implies that an inconsistency in the replicas exists. To fix the inconsistency a read repair is performed.
+
+For example, consider that that digest request returns a hash value that is not the same as the hash value of the data from the direct full read request. We would need to make the replicas consistent for which the coordinator node sends a direct (full) read request to the replica node that it sent a digest read request to earlier as illustrated in Figure 4.
+
+.. figure:: Figure_4_read_repair.jpg
+
+Figure 4. Coordinator sends  Direct Read Request to Replica Node it had sent Digest Read Request to
+
+After receiving the data from the second replica node the coordinator has data from two of the replica nodes. It only needs two replicas as the read consistency level is ``TWO`` in the example. Data from the two replicas is compared and based on the timestamps the most recent replica is selected. Data may need to be merged to construct an up-to-date copy of data if one replica has data for only some of the columns. In the example, if the data from the first direct read request is found t [...]
+
+.. figure:: Figure_5_read_repair.jpg
+
+Figure 5. Coordinator performs Read Repair
+
+
+The most up-to-date data is returned to the client as illustrated in Figure 6. From the three replicas Replica 1 is not even read and thus not repaired. Replica 2 is repaired. Replica 3 is the most up-to-date and returned to client.
+
+.. figure:: Figure_6_read_repair.jpg
+
+Figure 6. Most up-to-date Data returned to Client
+
+Read Consistency Level and Read Repair
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read consistency is most significant in determining if a read repair needs to be performed. As discussed in Table 1 a read repair is not needed for all of the consistency levels.
+
+Table 1. Read Repair based on Read Consistency Level
+
++----------------------+-------------------------------------------+
+|Read Consistency Level| Description                               |
++----------------------+-------------------------------------------+
+| ONE                  |Read repair is not performed as the        |
+|                      |data from the first direct read request    |
+|                      |satisfies the consistency level ONE.       |
+|                      |No digest read requests are involved       |
+|                      |for finding mismatches in data.            |
++----------------------+-------------------------------------------+
+| TWO                  |Read repair is performed if inconsistencies|
+|                      |in data are found as determined by the     |
+|                      |direct and digest read requests.           |
++----------------------+-------------------------------------------+
+| THREE                |Read repair is performed if inconsistencies|
+|                      |in data are found as determined by the     |
+|                      |direct and digest read requests.           |
++----------------------+-------------------------------------------+
+|LOCAL_ONE             |Read repair is not performed as the data   |
+|                      |from the direct read request from the      |
+|                      |closest replica satisfies the consistency  |
+|                      |level LOCAL_ONE.No digest read requests are|
+|                      |involved for finding mismatches in data.   |
++----------------------+-------------------------------------------+
+|LOCAL_QUORUM          |Read repair is performed if inconsistencies|
+|                      |in data are found as determined by the     |
+|                      |direct and digest read requests.           |
++----------------------+-------------------------------------------+
+|QUORUM                |Read repair is performed if inconsistencies|
+|                      |in data are found as determined by the     |
+|                      |direct and digest read requests.           |
++----------------------+-------------------------------------------+
+
+If read repair is performed it is made only on the replicas that are not up-to-date and that are involved in the read request. The number of replicas involved in a read request would be based on the read consistency level; in the example it is two.
+
+Improved Read Repair Blocking Behavior in Cassandra 4.0
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cassandra 4.0 makes two improvements to read repair blocking behavior (`CASSANDRA-10726
+<https://issues.apache.org/jira/browse/CASSANDRA-10726>`_).
+
+1. Speculative Retry of Full Data Read Requests. Cassandra 4.0 makes use of speculative retry in sending read requests (full, not digest) to replicas if a full data response is not received, whether in the initial full read request or a full data read request during read repair.  With speculative retry if it looks like a response may not be received from the initial set of replicas Cassandra sent messages to, to satisfy the consistency level, it speculatively sends additional read reques [...]
+
+2. Only blocks on Full Data Responses to satisfy the Consistency Level. Cassandra 4.0 only blocks for what is needed for resolving the digest mismatch and wait for enough full data responses to meet the consistency level, no matter whether it’s speculative retry or read repair chance. As an example, if it looks like Cassandra might not receive full data requests from everyone in time, it sends additional requests to additional replicas not contacted in the initial full data read. If the  [...]
+
+Diagnostic Events for Read Repairs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cassandra 4.0 adds diagnostic events for read repair (`CASSANDRA-14668
+<https://issues.apache.org/jira/browse/CASSANDRA-14668>`_) that can be used for exposing information such as:
+
+- Contacted endpoints
+- Digest responses by endpoint
+- Affected partition keys
+- Speculated reads / writes
+- Update oversized
+
+Background Read Repair
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Background read repair, which was configured using ``read_repair_chance`` and ``dclocal_read_repair_chance`` settings in ``cassandra.yaml`` is  removed Cassandra 4.0 (`CASSANDRA-13910
+<https://issues.apache.org/jira/browse/CASSANDRA-13910>`_).
 
-.. todo:: todo
+Read repair is not an alternative for other kind of repairs such as full repairs or replacing a node that keeps failing. The data returned even after a read repair has been performed may not be the most up-to-date data if consistency level is other than one requiring response from all replicas.
diff --git a/doc/source/operating/repair.rst b/doc/source/operating/repair.rst
index 97115dc..94fdc11 100644
--- a/doc/source/operating/repair.rst
+++ b/doc/source/operating/repair.rst
@@ -105,3 +105,104 @@ Other Options
 
 .. seealso::
     :ref:`nodetool repair docs <nodetool_repair>`
+
+Full Repair Example
+^^^^^^^^^^^^^^^^^^^^
+Full repair is typically needed to redistribute data after increasing the replication factor of a keyspace or after adding a node to the cluster. Full repair involves streaming SSTables. To demonstrate full repair start with a three node cluster.
+
+::
+
+ [ec2-user@ip-10-0-2-238 ~]$ nodetool status
+ Datacenter: us-east-1
+ =====================
+ Status=Up/Down
+ |/ State=Normal/Leaving/Joining/Moving
+ --  Address   Load        Tokens  Owns  Host ID                              Rack
+ UN  10.0.1.115  547 KiB     256    ?  b64cb32a-b32a-46b4-9eeb-e123fa8fc287  us-east-1b
+ UN  10.0.3.206  617.91 KiB  256    ?  74863177-684b-45f4-99f7-d1006625dc9e  us-east-1d
+ UN  10.0.2.238  670.26 KiB  256    ?  4dcdadd2-41f9-4f34-9892-1f20868b27c7  us-east-1c
+
+Create a keyspace with replication factor 3:
+
+::
+
+ cqlsh> DROP KEYSPACE cqlkeyspace;
+ cqlsh> CREATE KEYSPACE CQLKeyspace
+   ... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
+
+Add a table to the keyspace:
+
+::
+
+ cqlsh> use cqlkeyspace;
+ cqlsh:cqlkeyspace> CREATE TABLE t (
+            ...   id int,
+            ...   k int,
+            ...   v text,
+            ...   PRIMARY KEY (id)
+            ... );
+
+Add table data:
+
+::
+
+ cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (0, 0, 'val0');
+ cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (1, 1, 'val1');
+ cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (2, 2, 'val2');
+
+A query lists the data added:
+
+::
+
+ cqlsh:cqlkeyspace> SELECT * FROM t;
+
+ id | k | v
+ ----+---+------
+  1 | 1 | val1
+  0 | 0 | val0
+  2 | 2 | val2
+ (3 rows)
+
+Make the following changes to a three node cluster:
+
+1.       Increase the replication factor from 3 to 4.
+2.       Add a 4th node to the cluster
+
+When the replication factor is increased the following message gets output indicating that a full repair is needed as per (`CASSANDRA-13079
+<https://issues.apache.org/jira/browse/CASSANDRA-13079>`_):
+
+::
+
+ cqlsh:cqlkeyspace> ALTER KEYSPACE CQLKeyspace
+            ... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 4};
+ Warnings :
+ When increasing replication factor you need to run a full (-full) repair to distribute the
+ data.
+
+Perform a full repair on the keyspace ``cqlkeyspace`` table ``t`` with following command:
+
+::
+
+ nodetool repair -full cqlkeyspace t
+
+Full repair completes in about a second as indicated by the output:
+
+::
+
+[ec2-user@ip-10-0-2-238 ~]$ nodetool repair -full cqlkeyspace t
+[2019-08-17 03:06:21,445] Starting repair command #1 (fd576da0-c09b-11e9-b00c-1520e8c38f00), repairing keyspace cqlkeyspace with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [t], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 1024, pull repair: false, force repair: false, optimise streams: false)
+[2019-08-17 03:06:23,059] Repair session fd8e5c20-c09b-11e9-b00c-1520e8c38f00 for range [(-8792657144775336505,-8786320730900698730], (-5454146041421260303,-5439402053041523135], (4288357893651763201,4324309707046452322], ... , (4350676211955643098,4351706629422088296]] finished (progress: 0%)
+[2019-08-17 03:06:23,077] Repair completed successfully
+[2019-08-17 03:06:23,077] Repair command #1 finished in 1 second
+[ec2-user@ip-10-0-2-238 ~]$
+
+The ``nodetool  tpstats`` command should list a repair having been completed as ``Repair-Task`` > ``Completed`` column value of 1:
+
+::
+
+ [ec2-user@ip-10-0-2-238 ~]$ nodetool tpstats
+ Pool Name Active   Pending Completed   Blocked  All time blocked
+ ReadStage  0           0           99       0              0
+ …
+ Repair-Task 0       0           1        0              0
+ RequestResponseStage                  0        0        2078        0               0


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org