You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@cassandra.apache.org by GitBox <gi...@apache.org> on 2022/05/11 21:13:14 UTC
[GitHub] [cassandra] maedhroz commented on a diff in pull request #1294: CASSANDRA-6936: Byte-comparable API

maedhroz commented on code in PR #1294:
URL: https://github.com/apache/cassandra/pull/1294#discussion_r870759638


##########
src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md:
##########
@@ -0,0 +1,689 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Byte-comparable translation of types (ByteComparable/ByteSource)
+
+## Problem / Motivation
+
+Cassandra has a very heavy reliance on comparisons — they are used throughout read and write paths, coordination,
+compaction, etc. to be able to order and merge results. It also supports a range of types which often require the 
+compared object to be completely in memory to order correctly, which in turn has necessitated interfaces where 
+comparisons can only be applied if the compared objects are completely loaded.
+
+This has some rather painful implications on the performance of the database, both in terms of the time it takes to load,
+compare and garbage collect, as well as in terms of the space required to hold complete keys in on-disk indices and
+deserialized versions in in-memory data structures. In addition to this, the reliance on comparisons forces Cassandra to
+use only comparison-based structures, which aren’t the most efficient.
+
+There is no way to escape the need to compare and order objects in Cassandra, but the machinery for doing this can be
+done much more smartly if we impose some simple structure in the objects we deal with — byte ordering.
+
+The term “byte order” as used in this document refers to the property of being ordered via lexicographic compare on the
+unsigned values of the byte contents. Some of the types in Cassandra already have this property (e.g. strings, blobs),
+but other most heavily used ones (e.g. integers, uuids) don’t.
+
+When byte order is universally available for the types used for keys, several key advantages can be put to use:
+- Comparisons can be done using a single simple method, core machinery doesn’t need to know anything about types.
+- Prefix differences are enough to define order; unique prefixes can be used instead of complete keys.
+- Tries can be used to store, query and iterate over ranges of keys, providing fast lookup and prefix compression.
+- Merging can be performed by merging tries, significantly reducing the number of necessary comparisons.
+
+## Ordering the types
+
+As we want to keep all existing functionality in Cassandra, we need to be able to deal with existing
+non-byte-order-comparable types. This requires some form of conversion of each value to a sequence of bytes that can be 
+byte-order compared (also called "byte-comparable"), as well as the inverse conversion from byte-comparable to value.
+
+As one of the main advantages of byte order is the ability to decide comparisons early, without having to read the whole
+of the input sequence, byte-ordered interpretations of values are represented as sources of bytes with unknown length, 
+using the interface `ByteSource`. The interface declares one method, `next()` which produces the next byte of the
+stream, or `ByteSource.END_OF_STREAM` if the stream is exhausted.
+
+`END_OF_STREAM` is chosen as `-1` (`(int) -1`, which is outside the range of possible byte values), to make comparing 
+two byte sources as trivial (and thus fast) as possible.
+  
+To be able to completely abstract type information away from the storage machinery, we also flatten complex types into
+single byte sequences. To do this, we add separator bytes in front, between components, and at the end and do some 
+encoding of variable-length sequences.
+
+The other interface provided by this package `ByteComparable`, is an entity whose byte-ordered interpretation can be
+requested. The interface is implemented by `DecoratedKey`, and can be requested for clustering keys and bounds using
+`ClusteringComparator.asByteComparable`. The inverse translation is provided by 
+`Buffer/NativeDecoratedKey.fromByteComparable` and `ClusteringComparator.clustering/bound/boundaryFromByteComparable`.
+
+The (rather technical) paragraphs below detail the encoding we have chosen for the various types. For simplicity we
+only discuss the bidirectional `OSS41` version of the translation. The implementations in code of the various mappings
+are in the releavant `AbstractType` subclass.
+
+### Desired properties
+
+Generally, we desire the following two properties from the byte-ordered translations of values we use in the database:
+- Comparison equivalence (1):  
+    <math xmlns="http://www.w3.org/1998/Math/MathML">
+      <semantics>
+        <mstyle displaystyle="true">
+          <mo>&#x2200;</mo>
+          <mi>x</mi>
+          <mo>,</mo>
+          <mi>y</mi>
+          <mo>&#x2208;</mo>
+          <mi>T</mi>
+          <mo>,</mo>
+          <mrow>
+            <mtext>compareBytesUnsigned</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>T</mi>
+            <mo>.</mo>
+            <mrow>
+              <mtext>byteOrdered</mtext>
+            </mrow>
+            <mrow>
+              <mo>(</mo>
+              <mi>x</mi>
+              <mo>)</mo>
+            </mrow>
+            <mo>,</mo>
+            <mi>T</mi>
+            <mo>.</mo>
+            <mrow>
+              <mtext>byteOrdered</mtext>
+            </mrow>
+            <mrow>
+              <mo>(</mo>
+              <mi>y</mi>
+              <mo>)</mo>
+            </mrow>
+            <mo>)</mo>
+          </mrow>
+          <mo>=</mo>
+          <mi>T</mi>
+          <mo>.</mo>
+          <mrow>
+            <mtext>compare</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>x</mi>
+            <mo>,</mo>
+            <mi>y</mi>
+            <mo>)</mo>
+          </mrow>
+        </mstyle>
+        <!-- <annotation encoding="text/x-asciimath">forall x,y in T, "compareBytesUnsigned"(T."byteOrdered"(x), T."byteOrdered"(y))=T."compare"(x, y)</annotation> -->
+      </semantics>
+    </math>
+- Prefix-freedom (2):  
+    <math xmlns="http://www.w3.org/1998/Math/MathML">
+      <semantics>
+        <mstyle displaystyle="true">
+          <mo>&#x2200;</mo>
+          <mi>x</mi>
+          <mo>,</mo>
+          <mi>y</mi>
+          <mo>&#x2208;</mo>
+          <mi>T</mi>
+          <mo>,</mo>
+          <mi>T</mi>
+          <mo>.</mo>
+          <mrow>
+            <mtext>byteOrdered</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>x</mi>
+            <mo>)</mo>
+          </mrow>
+          <mrow>
+            <mspace width="1ex" />
+            <mtext> is not a prefix of </mtext>
+            <mspace width="1ex" />
+          </mrow>
+          <mi>T</mi>
+          <mo>.</mo>
+          <mrow>
+            <mtext>byteOrdered</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>y</mi>
+            <mo>)</mo>
+          </mrow>
+        </mstyle>
+        <!-- <annotation encoding="text/x-asciimath">forall x,y in T, T."byteOrdered"(x) " is not a prefix of " T."byteOrdered"(y)</annotation> -->
+      </semantics>
+    </math>
+   
+The former is the essential requirement, and the latter allows construction of encodings of sequences of multiple
+values, as well as a little more efficiency in the data structures.
+
+To more efficiently encode byte-ordered blobs, however, we use a slightly tweaked version of the above requirements:
+
+- Comparison equivalence (3):  
+    <math xmlns="http://www.w3.org/1998/Math/MathML">
+      <semantics>
+        <mstyle displaystyle="true">
+          <mo>&#x2200;</mo>
+          <mi>x</mi>
+          <mo>,</mo>
+          <mi>y</mi>
+          <mo>&#x2208;</mo>
+          <mi>T</mi>
+          <mo>,</mo>
+          <mo>&#x2200;</mo>
+          <msub>
+            <mi>b</mi>
+            <mn>1</mn>
+          </msub>
+          <mo>,</mo>
+          <msub>
+            <mi>b</mi>
+            <mn>2</mn>
+          </msub>
+          <mo>&#x2208;</mo>
+          <mrow>
+            <mo>[</mo>
+            <mn>0x10</mn>
+            <mo>-</mo>
+            <mn>0xEF</mn>
+            <mo>]</mo>
+          </mrow>
+          <mo>,</mo>
+            <mtext><br/></mtext>
+          <mrow>
+            <mtext>compareBytesUnsigned</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>T</mi>
+            <mo>.</mo>
+            <mrow>
+              <mtext>byteOrdered</mtext>
+            </mrow>
+            <mrow>
+              <mo>(</mo>
+              <mi>x</mi>
+              <mo>)</mo>
+            </mrow>
+            <mo>+</mo>
+            <msub>
+              <mi>b</mi>
+              <mn>1</mn>
+            </msub>
+            <mo>,</mo>
+            <mi>T</mi>
+            <mo>.</mo>
+            <mrow>
+              <mtext>byteOrdered</mtext>
+            </mrow>
+            <mrow>
+              <mo>(</mo>
+              <mi>y</mi>
+              <mo>)</mo>
+            </mrow>
+            <mo>+</mo>
+            <msub>
+              <mi>b</mi>
+              <mn>2</mn>
+            </msub>
+            <mo>)</mo>
+          </mrow>
+          <mo>=</mo>
+          <mi>T</mi>
+          <mo>.</mo>
+          <mrow>
+            <mtext>compare</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>x</mi>
+            <mo>,</mo>
+            <mi>y</mi>
+            <mo>)</mo>
+          </mrow>
+        </mstyle>
+        <!-- <annotation encoding="text/x-asciimath">forall x,y in T, forall b_1, b_2 in [0x10-0xEF],
+    "compareBytesUnsigned"(T."byteOrdered"(x)+b_1, T."byteOrdered"(y)+b_2)=T."compare"(x, y)</annotation> -->
+      </semantics>
+    </math>
+- Weak prefix-freedom (4):  
+    <math xmlns="http://www.w3.org/1998/Math/MathML">
+      <semantics>
+        <mstyle displaystyle="true">
+          <mo>&#x2200;</mo>
+          <mi>x</mi>
+          <mo>,</mo>
+          <mi>y</mi>
+          <mo>&#x2208;</mo>
+          <mi>T</mi>
+          <mo>,</mo>
+          <mo>&#x2200;</mo>
+          <mi>b</mi>
+          <mo>&#x2208;</mo>
+          <mrow>
+            <mo>[</mo>
+            <mn>0x10</mn>
+            <mo>-</mo>
+            <mn>0xEF</mn>
+            <mo>]</mo>
+          </mrow>
+          <mo>,</mo>
+            <mtext><br/></mtext>
+          <mrow>
+            <mo>(</mo>
+            <mi>T</mi>
+            <mo>.</mo>
+            <mrow>
+              <mtext>byteOrdered</mtext>
+            </mrow>
+            <mrow>
+              <mo>(</mo>
+              <mi>x</mi>
+              <mo>)</mo>
+            </mrow>
+            <mo>+</mo>
+            <mi>b</mi>
+            <mo>)</mo>
+          </mrow>
+          <mrow>
+            <mspace width="1ex" />
+            <mtext> is not a prefix of </mtext>
+            <mspace width="1ex" />
+          </mrow>
+          <mi>T</mi>
+          <mo>.</mo>
+          <mrow>
+            <mtext>byteOrdered</mtext>
+          </mrow>
+          <mrow>
+            <mo>(</mo>
+            <mi>y</mi>
+            <mo>)</mo>
+          </mrow>
+        </mstyle>
+        <!-- <annotation encoding="text/x-asciimath">forall x,y in T, forall b in [0x10-0xEF],
+    (T."byteOrdered"(x)+b) " is not a prefix of " T."byteOrdered"(y)</annotation> -->
+      </semantics>
+    </math>
+
+These versions allow the addition of a separator byte after each value, and guarantee that the combination with 
+separator fulfills the original requirements. (3) is somewhat stronger than (1) but is necessarily true if (2) is also 
+in force, while (4) trivially follows from (2).
+
+## Fixed length unsigned integers (Murmur token, date/time)
+
+This is the trivial case, as we can simply use the input bytes in big-endian order. The comparison result is the same, 
+and fixed length values are trivially prefix free, i.e. (1) and (2) are satisfied, and thus (3) and (4) follow from the
+observation above.
+
+## Fixed-length signed integers (byte, short, int, legacy bigint)
+
+As above, but we need to invert the sign bit of the number to put negative numbers before positives. This maps 
+`MIN_VALUE` to `0x00`..., `-1` to `0x7F…`, `0` to `0x80…`, and `MAX_VALUE` to `0xFF…`; comparing the resulting number 
+as an unsigned integer has the same effect as comparing the source signed.
+
+Examples:
+
+|Type and value|bytes|encodes as|
+|--------------|-----|----------|
+|int 1         |00 00 00 01|             80 00 00 01
+|short -1      |FF FF      |             7F FF
+|byte 0        |00         |             80
+|int MAX_VALUE |7F FF FF FF|             FF FF FF FF
+|long MIN_VALUE|80 00 00 00 00 00 00 00| 00 00 00 00 00 00 00 00
+
+## Variable-length encoding of integers (current bigint)
+
+Another way to encode integers that may save significant amounts of space when smaller numbers are often in use, but
+still permits large values to be efficiently encoded, is to use an encoding scheme similar to UTF-8.
+
+For unsigned numbers this can be done by starting the number with as many 1s in most significant bits as there are 
+additional bytes in the encoding, followed by a 0, and the bits of the number. Numbers between 0 and 127 are encoded
+in one byte, and each additional byte adds 7 more bits. Values that use all 8 bytes do not need a 9th bit of 0 and can
+thus fit 9 bytes. Because longer numbers have more 1s in their MSBs, they compare 
+higher than shorter ones (and we always use the shortest representation). Because the length is specified through these
+initial bits, no value can be a prefix of another.
+
+| Value            | bytes                   |encodes as|
+|------------------|-------------------------|----------|
+| 0                | 00 00 00 00 00 00 00 00 |             00
+| 1                | 00 00 00 00 00 00 00 01 |             01
+| 127 (2^7-1)      | 00 00 00 00 00 00 00 7F |             7F
+| 128 (2^7)        | 00 00 00 00 00 00 00 80 |             80 80
+| 16383 (2^14 - 1) | 00 00 00 00 00 00 3F FF |             BF FF
+| 16384 (2^14)     | 00 00 00 00 00 00 40 00 |             C0 40 00
+| 2^31 - 1         | 00 00 00 00 7F FF FF FF |         F0 7F FF FF FF
+| 2^31             | 00 00 00 00 80 00 00 00 |         F0 80 00 00 00
+| 2^56 - 1         | 00 FF FF FF FF FF FF FF | FE FF FF FF FF FF FF FF
+| 2^56             | 01 00 00 00 00 00 00 00 | FF 01 00 00 00 00 00 00 00
+| 2^64- 1          | FF FF FF FF FF FF FF FF | FF FF FF FF FF FF FF FF FF
+
+
+To encode signed numbers, we must start with the sign bit, and must also ensure that longer negative numbers sort 
+smaller than shorter ones. The first bit of the encoding is the inverted sign (i.e. 1 for positive, 0 for negative),
+followed by the length encoded as a sequence of bits that matches the inverted sign, followed by a bit that differs 
+(like above, not necessary for 9-byte encodings) and the bits of the number's two's complement.
+
+| Value             | bytes                    |encodes as|
+|-------------------|--------------------------|----------|
+| 1                 | 00 00 00 00 00 00 00 01  |             01
+| -1                | FF FF FF FF FF FF FF FF  |             7F
+| 0                 | 00 00 00 00 00 00 00 00  |             80
+| 63                | 00 00 00 00 00 00 00 3F  |             BF
+| -64               | FF FF FF FF FF FF FF C0  |             40
+| 64                | 00 00 00 00 00 00 00 40  |             C0 40
+| -65               | FF FF FF FF FF FF FF BF  |             3F BF
+| 8191              | 00 00 00 00 00 00 1F FF  | DF FF
+| 8192              | 00 00 00 00 00 00 20 00  | E0 20 00
+| Integer.MAX_VALUE | 00 00 00 00 7F FF FF FF  |             F8 7F FF FF FF
+| Long.MIN_VALUE    | 80 00 00 00 00 00 00 00  | 00 00 00 00 00 00 00 00 00
+
+
+## Fixed-size floating-point numbers (float, double)
+
+IEEE-754 was designed with byte-by-byte comparisons in mind, and provides an important guarantee about the bytes of a
+floating point number:  
+* If x and y are of the same sign, bytes(x) ≥ bytes(y) ⇔ |x| ≥ |y|.
+
+Thus, to be able to order floating point numbers as unsigned integers, we can:
+* Flip the sign bit so negatives are smaller than positive numbers.
+* If the number was negative, also flip all the other bits so larger magnitudes become smaller integers.
+
+This matches exactly the behaviour of `Double.compare`, which doesn’t fully agree with numerical comparisons (see spec) 
+in order to define a natural order over the floating point numbers.
+
+Examples:
+
+|Type and value|bytes|encodes as|
+|---|---|---|
+|float +1.0|            3F 80 00 00|               BF 80 00 00|
+|float +0.0|            00 00 00 00|               80 00 00 00|
+|float -0.0|            80 00 00 00|               7F FF FF FF|
+|float -1.0|            BF 80 00 00|               40 7F FF FF|
+|double +1.0|           3F F0 00 00 00 00 00 00|   BF F0 00 00 00 00 00 00|
+|double +Inf|           7F F0 00 00 00 00 00 00|   FF F0 00 00 00 00 00 00|
+|double -Inf|           FF F0 00 00 00 00 00 00|   00 0F FF FF FF FF FF FF|
+|double NaN|            7F F8 00 00 00 00 00 00|   FF F8 00 00 00 00 00 00|
+
+## UUIDs
+UUIDs are fixed-length unsigned integers, where the UUID version/type is compared first, and where bits need to be 
+reordered for the time UUIDs. To create a byte-ordered representation, we reorder the bytes: pull the version digit 
+first, then the rest of the digits, using the special time order if the version is equal to one.
+
+Examples:
+
+|Type and value|bytes|encodes as|
+|---|---|---|
+|Random (v4)|    cc520882-9507-44fb-8fc9-b349ecdee658 |    4cc52088295074fb8fc9b349ecdee658
+|Time (v1)  |    2a92d750-d8dc-11e6-a2de-cf8ecd4cf053 |    11e6d8dc2a92d750a2decf8ecd4cf053
+
+## Multi-component sequences (Partition or Clustering keys, tuples), bounds and nulls
+
+As mentioned above, we encode sequences by adding separator bytes in front, between components, and a terminator at the
+end. The values we chose for the separator and terminator are `0x40` and `0x38`, and they serve several purposes:
+- Permits partially specified bounds, with strict/exclusive or non-strict/inclusive semantics. This is done by finishing
+  a bound with a terminator value that is smaller/greater than the separator and terminator. We can use `0x20` for </≥
+  and `0x60` for ≤/>.

Review Comment:
   ```suggestion
     and `0x60` for `≤`/`>`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org