You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by da...@apache.org on 2018/01/25 19:10:46 UTC

[2/2] kudu git commit: design-docs: improve cfile.md

design-docs: improve cfile.md

I've continually found the CFile abstraction difficult to place in
context with the higher-level DiskRowSet and lower-level BlockManager
abstractions. Every time I revisit the CFile code I have to re-research
these relationships and interactions. cfile.md is currently lacking in
these details, so this is an attempt to fix that.

Change-Id: I770028bba3f7a49c96f32893c285221c84be39ce
Reviewed-on: http://gerrit.cloudera.org:8080/8860
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <da...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/cd7166cb
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/cd7166cb
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/cd7166cb

Branch: refs/heads/master
Commit: cd7166cb778aed0a26f8e76d4023ff4054c58d76
Parents: 0f0e421
Author: Dan Burkert <da...@apache.org>
Authored: Sat Dec 16 14:12:33 2017 -0800
Committer: Dan Burkert <da...@cloudera.com>
Committed: Thu Jan 25 19:05:54 2018 +0000

----------------------------------------------------------------------
 docs/design-docs/cfile.md  | 398 +++++++++++++++++++++++++---------------
 docs/design-docs/tablet.md |   4 +-
 2 files changed, 253 insertions(+), 149 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/cd7166cb/docs/design-docs/cfile.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/cfile.md b/docs/design-docs/cfile.md
index 7f4a0f4..330385f 100644
--- a/docs/design-docs/cfile.md
+++ b/docs/design-docs/cfile.md
@@ -11,208 +11,312 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
-CFile is a simple columnar format which stores multiple related B-Trees.
 
+# CFile
 
-File format
------------------
-```
-<header>
-<blocks>
-<btree root blocks>
-<footer>
-EOF
+CFile is an on-disk columnar storage format which holds data and associated
+B-Tree indexes. Within a [DiskRowSet](tablet.md) each column and DeltaFile will
+map to a single CFile. In addition, the DiskRowSet's bloomfilter will be stored
+in a CFile, and if the table has a compound primary key, then the associated
+ad-hoc index will be stored in a separate CFile.
 
+Despite the name, a CFile does not necessarily correspond one-to-one with a file
+in the filesystem. The mapping of CFiles to the filesystem is determined by the
+[BlockManager](../../src/kudu/fs/block_manager.h) abstraction. A CFile is
+written to a single BlockManager Block (not to be confused with cblocks, which
+are internal to CFiles, discussed below).
 
-Header
-------
+A CFile is composed of a header, a variable number of cblocks, and a footer.
+There are a few different types of cblocks, each with a different format: data
+cblocks, nullable data cblocks, index cblocks, and dictionary cblocks.
 
-<magic>: see below
-<header length>: 32-bit unsigned integer length of the header protobuf
-<header>: CFileHeaderPB protobuf
-<checksum>: An optional Crc32 checksum of the magic, length, and protobuf
+Data cblocks consist of a sequence of consecutive values. Each value is assigned
+an ordinal index, or offset, which is unique within the CFile. Thus, a cblock
+can be identified within a CFile by the range of ordinal indexes it contains;
+for instance, the first three cblocks of a CFile might have the ordinal index
+ranges `[0, 55]`, `[56, 132]`, and `[133, 511]`.
 
-Block
------
-<data>: see below
-<checksum>: An optional Crc32 checksum of the data
+In order to support efficient seeks to an arbitrary ordinal index within the
+CFile, a positional index may be included which contains a mapping of starting
+ordinal index to data cblock. See [CFile Index](#cfile-index) for more details.
 
-Footer
-------
+For CFiles which contain data in sorted order (for example, a CFile containing
+the Primary Key column), a value index may be created. The value index supports
+efficiently seeking to an arbitrary value in the CFile. See
+[CFile Index](#cfile-index) for more details.
 
-<checksum>: An optional Crc32 checksum of the protobuf, magic, and length
-<footer>: CFileFooterPB protobuf
-<magic>: see below
-<footer length>: 32-bit unsigned integer length of the footer protobuf
+## File Format
 
+The CFile header, cblocks, and footer are written consecutively to the file
+without padding.
 
-Magic strings
--------------
-CFile v1: the string 'kuducfil'
-CFile v2: the string 'kuducfl2'
-```
+### Header Layout
 
-The two versions are identical except where specifically noted in the source
-code or this document.
+| field | width (bytes) | notes |
+| --- | --- | --- |
+| magic | 8 | [see below](#versioning) |
+| length | 4 | 32-bit unsigned length of the following Protobuf message |
+| message | variable | encoded [`CFileHeaderPB`][cfile.proto] Protobuf message |
+| checksum | 4 | optional CRC-32 checksum of magic, length, and message |
 
-==============================
+### Footer Layout
 
-Data blocks:
+| field | width (bytes) | notes |
+| --- | --- | --- |
+| checksum | 4 | optional CRC-32 checksum of message, magic, and length |
+| magic | 8 | [see below](#versioning) |
+| message | variable | encoded [`CFileFooterPB`][cfile.proto] Protobuf message |
+| length | 4 | unsigned 32-bit length of the preceding Protobuf message |
 
-Data blocks are stored with various types of encodings.
+### Versioning
 
-* Prefix Encoding
+The CFile header and footer each contain a 'magic' string which indicates the
+CFile's version.
 
-Currently used for STRING blocks. This is based on the encoding used
-by LevelDB for its data blocks, more or less.
+| version | string |
+| --- | --- |
+| CFile v1 | `kuducfil` |
+| CFile v2 | `kuducfl2` |
 
-Starts with a header of four uint32s, group-varint coded:
-```
-  <num elements>       \
-  <ordinal position>   |
-  <restart interval>   |  group varint 32
-  <unused>             /
-```
-Followed by prefix-compressed values. Each value is stored relative
-to the value preceding it using the following format:
-```
-  shared_bytes: varint32
-  unshared_bytes: varint32
-  delta: char[unshared_bytes]
-```
-Periodically there will be a "restart point" which is necessary for
-faster binary searching. At a "restart point", shared_bytes is
-0 but otherwise the encoding is the same.
+CFile v2 was introduced in Kudu 1.3 in order to make a breaking change in the
+CFile format. At the same time, two sets of feature flags were introduced to the
+footer in order to make more granular forwards compatibility possible in the
+future. See commit [`f82cf6918c`] for details.
 
-At the end of the block is a trailer with the offsets of the
-restart points:
-```
-  restart_points[num_restarts]:  uint32
-  num_restarts: uint32
-```
-The restart points are offsets relative to the start of the block,
-including the header.
+### Data cblock Layout
 
+| field | width (bytes) | notes |
+| --- | --- | --- |
+| data | variable | [encoded](#data-encodings) data values |
+| checksum | 4 | optional CRC-32 checksum of the data |
 
-* Group Varint Frame-Of-Reference Encoding
+If the CFile is configured to use compression then the encoded data is
+compressed before the checksum is appended.
 
-Used for uint32 blocks.
+### Nullable Data cblock Layout
 
-Starts with a header:
-```
-<num elements>     \
-<min element>      |
-<ordinal position> | group varint 32
-<unused>           /
-```
-The ordinal position is the ordinal position of the first item in the
-file. For example, the first data block in the file has ordinal position
-0. If that block had 400 data entries, then the second data block would
-have ordinal position 400.
+Columns marked as nullable in the schema are stored in a nullable cblock, which
+includes a bitmap which tracks which rows (ordinal offsets) are null and not
+null.
 
-Followed by the actual data, each set of 4 integers using group-varint.
-The last group is padded with 0s.
-Each integer is relative to the min element in the header.
+| field | width (bytes) | notes |
+| --- | --- | --- |
+| value count | variable | [LEB128] encoded count of values |
+| bitmap length | variable | [LEB128] encoded length of the following null bitmap |
+| null bitmap | variable | [RLE] encoded bitmap |
+| data | variable | encoded non-null data values |
+| checksum | 4 | optional CRC-32 checksum of the value count, bitmap length, null bitmap, and data |
 
-==============================
+If the CFile is configured to use compression then the value count, bitmap
+length, null bitmap, and encoded data are compressed before the checksum is
+appended.
 
-Nullable Columns:
+### Checksums
 
-If a column is marked as nullable in the schema, a bitmap is used to keep track
-of the null and not null rows.
+Checksums can be optionally written and verified.
 
-The bitmap is added the begininning of the data block, and it uses RLE.
-```
-  <num elements in the block>   : vint
-  <null bitmap size>            : vint
-  <null bitmap>                 : RLE encoding
-  <encoded non-null values>     : encoded data
-```
-Data Block Example - 4 items, the first and last are nulls.
-```
-  4        Num Elements in the block
-  1        Null Bitmap Size
-  0110     Null Bitmap
-  v2       Value of row 2
-  v3       Value of row 3
-```
-==============================
+When checksums for the header, data, and footer are written in the CFile, the
+`incompatible_features` bitset in the CFileFooterPB message is used. A
+"checksum" bit is set to ensure the reader knows if checksums exist.
+
+When reading a CFile the footer should be read first to find if the file
+contains checksums. If the `incompatible_features` bitset indicates checksums
+exist, the reader can optionally validate them against the read data.
+
+## Data Encodings
+
+cblock data is encoded before being stored. If compression is enabled, then the
+cblock is encoded first, and then compressed.
+
+Data cblocks are best-effort size limited to `--cfile-default-block-size` bytes, at
+which point a new cblock will be added to the CFile.
+
+### Plain Encoding
+
+The simplest encoding type is plain encoding, which stores values in their
+natural representation.
+
+Plain encoding for fixed-width (integer) types consists of the little-endian
+values, followed by a trailer containing two little-endian `uint32`s: the count
+of values, and the ordinal position of the cblock.
+
+Plain encoding for BINARY or STRING (variable-width) columns is laid out as
+follows:
+
+| ordinal position | little-endian `uint32` |
+| --- | --- |
+| num values | little-endian `uint32` |
+| offsets position | little-endian `uint32` |
+| values | sequential values with no padding |
+| offsets | [group-varint frame-of-reference](#group-varint-frame-of-reference-encoding) encoded value offsets |
+
+### Dictionary Encoding
+
+[Dictionary encoding](dictionary-encoding) may be used for BINARY or STRING
+columns. All dictionary encoded cblocks in a CFile share the same dictionary.
+If the dictionary becomes full, subsequent cblocks in the CFile switch to plain
+encoding. The dictionary is stored as a plain encoded binary cblock, and the
+data codewords are stored as [bitshuffle encoded](#bitshuffle-encoding)
+`uint32`s.
+
+### Prefix Encoding
+
+Currently used for `BINARY` and `STRING` data cblocks. This is based on the
+encoding used by LevelDB for its data blocks, more or less.
 
-Index blocks:
+Starts with a header of four [Group Varint] encoded `uint32`s:
 
-The index blocks are organized in a B-Tree. As data blocks are written,
-they are appended to the end of a leaf index block. When a leaf index
-block reaches the configured block size, it is added to another index
-block higher up the tree, and a new leaf is started. If the intermediate
-index block fills, it will start a new intermediate block and spill into
-an even higher-layer internal block.
+| field |
+| --- |
+| number of elements |
+| ordinal position |
+| restart interval |
+| unused |
+
+Followed by prefix-compressed values. Each value is stored relative to the value
+preceding it using the following format:
+
+| field | type |
+| --- | --- |
+| `shared_bytes` | `varint32` |
+| `unshared_bytes` | `varint32` |
+| `delta` | `char[unshared_bytes]` |
+
+Periodically there will be a "restart point" which is necessary for faster
+binary searching. At a "restart point", `shared_bytes` is 0 but otherwise the
+encoding is the same.
+
+At the end of the cblock is a trailer with the offsets of the restart points:
+
+| field | type |
+| --- | --- |
+| `restart_points` | `uint32[num_restarts]` |
+| `num_restarts` | `uint32` |
+
+The restart points are offsets relative to the start of the cblock, including
+the header.
+
+### Run-length Encoding
+
+Integer and bool types may be [run-length encoded](RLE), which has a simply
+layout: the number of values and ordinal position of the cblock as little-endian
+`uint32`s, followed by the run-length encoded data. The encoder will
+automatically fall back to a bit-packed (literal) encoding if the data is not
+conducive to run-length encoding.
+
+### Bitshuffle Encoding
+
+Integer types may be [bitshuffle](bitshuffle) encoded. Bitshuffle-encoded
+cblocks are automatically LZ4 compressed, so additional compression is not
+recommended.
+
+### Group Varint Frame-of-Reference Encoding
+
+Used internally for `UINT32` blocks. Kudu does not support unsigned integer
+columns, so this encoding is not used for column data.
+
+Starts with a header of four [group-varint](group-varint) encoded `uint32`s:
+
+| field |
+| --- |
+| number of elements |
+| minimum element |
+| ordinal position |
+| unused |
+
+The ordinal position is the ordinal position of the first item in the file. For
+example, the first data cblock in the file has ordinal position 0. If that cblock
+had 400 values, then the second data cblock would have ordinal position
+400.
+
+Followed by the actual data, each set of 4 integers using group-varint. The last
+group is padded with 0s. Each integer is relative to the min element in the
+header.
+
+## CFile Index
+
+CFiles may optionally include a positional (ordinal) index and value index.
+Positional indexes are used to satisfy queries such as: "seek to the data cblock
+containing the Nth entry in this CFile". Value indexes are used to satisfy
+queries such as: "seek to the data cblock containing `123` in this CFile". Value
+indexes are only present in CFiles which contain data stored in sorted order
+(e.g. the primary key column).
+
+Ordinal and value indices are organized as a [B-Tree] of index cblocks. As data
+cblocks are written, entries are appended to the end of a leaf index cblock.
+When a leaf index cblock reaches the configured index cblock size, it is added
+to another index cblock higher up the tree, and a new leaf is started.  If the
+intermediate index cblock fills, it will start a new intermediate cblock and
+spill into an even higher-layer internal cblock.
 
 For example:
+
 ```
                       [Int 0]
-           ------------------------------
-           |                            |
-        [Int 1]                       [Int 2]
-    -----------------            --------------
-    |       |       |            |             |
-[Leaf 0]     ...   [Leaf N]   [Leaf N+1]    [Leaf N+2]
+           -----------------------------
+           |                           |
+        [Int 1]                     [Int 2]
+   -----------------             --------------
+   |       |       |             |            |
+[Leaf 0]  ...   [Leaf N]     [Leaf N+1]   [Leaf N+2]
 ```
 
-In this case, we wrote N leaf blocks, which filled up the node labeled
-Int 1. At this point, the writer would create Int 0 with one entry pointing
-to Int 1. Further leaf blocks (N+1 and N+2) would be written to a new
-internal node (Int 2). When the file is completed, Int 2 will spill,
-adding its entry into Int 0 as well.
+In this case, we wrote N leaf blocks, which filled up the internal node labeled
+Int 1. At this point, the writer would create Int 0 with one entry pointing to
+Int 1. Further leaf blocks (N+1 and N+2) would be written to a new internal node
+(Int 2). When the file is completed, Int 2 will spill, adding its entry into Int 0
+as well.
 
-Note that this strategy doesn't result in a fully balanced b-tree, but instead
+Note that this strategy doesn't result in a fully balanced B-tree, but instead
 results in a 100% "fill factor" on all nodes in each level except for the last
 one written.
 
-There are two types of indexes:
-
-- Positional indexes: map ordinal position -> data block offset
+An index cblock is encoded similarly for both types of indexes:
 
-These are used to satisfy queries like: "seek to the Nth entry in this file"
-
-- Value-based indexes: reponsible for mapping value -> data block offset
-
-These are only present in files which contain data stored in sorted order
-(e.g key columns). They can satisfy seeks by value.
-
-
-An index block is encoded similarly for both types of indexes:
 ```
-<key> <block offset> <block size>
-<key> <block offset> <block size>
+<key> <cblock offset> <cblock size>
+<key> <cblock offset> <cblock size>
 ...
    key: vint64 for positional, otherwise varint-length-prefixed string
    offset: vint64
-   block size: vint32
+   cblock size: vint32
 
 <offset to first key>   (fixed32)
 <offset to second key>  (fixed32)
 ...
-   These offsets are relative to the start of the block.
+   These offsets are relative to the start of the cblock.
 
 <trailer>
    A IndexBlockTrailerPB protobuf
 <trailer length>
 ```
-The trailer protobuf includes a field which designates whether the block
-is a leaf node or internal node of the B-Tree, allowing a reader to know
-whether the pointer is to another index block or to a data block.
 
-==============================
+Index cblocks are written using a [post-order traversal], and the index cblocks
+may intersperse with the data cblocks.
 
-Checksums:
+The trailer protobuf includes a field which designates whether the cblock is a
+leaf node or internal node of the B-Tree, allowing a reader to know whether the
+pointer is to another index cblock or to a data cblock.
 
-Checksums can be optionally written and verified.
+## DeltaFile
+
+Every DeltaFile in a DiskRowSet is written to a CFile; in fact, a DeltaFile is
+just a wrapper around CFile which specializes it for REDO and UNDO delta data.
+The deltas associated with a row update are grouped into a RowChangeList and
+written to as a binary values to the underlying CFile. Each value is prefixed
+with a DeltaKey, which consists of the ordinal row index (within the
+DiskRowSet), and the timestamp. The CFile includes a value index so that the
+delta belonging to a specific row can be located efficiently.
+
+## BloomFile
 
-When checksums for the header, data, and footer are written in the CFile,
-the incompatible_features bitset in the CFile footer is used. A "checksum"
-bit is set to ensure the reader knows if checksums exist.
+BloomFile is a wrapper around CFile which stores a bloomfilter datastructure.
 
-When reading a CFile the footer should be read first to find if the
-file contains checksums. If the incompatible_features bitset indicates
-checksums exist, the reader can optionally validate them against the
-read data.
\ No newline at end of file
+[LEB128]: https://en.wikipedia.org/wiki/LEB128
+[RLE]: https://en.wikipedia.org/wiki/Run-length_encoding
+[`f82cf6918c`]: https://github.com/apache/kudu/commit/f82cf6918c00dff6aecad5a6b50836b7eb5db508
+[B-tree]: https://en.wikipedia.org/wiki/B-tree
+[bitshuffle]: https://github.com/kiyo-masui/bitshuffle
+[cfile.proto]: ../../src/kudu/cfile/cfile.proto
+[Group Varint]: https://static.googleusercontent.com/media/research.google.com/en//people/jeff/WSDM09-keynote.pdf
+[post-order traversal]: https://en.wikipedia.org/wiki/Tree_traversal#Post-order

http://git-wip-us.apache.org/repos/asf/kudu/blob/cd7166cb/docs/design-docs/tablet.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/tablet.md b/docs/design-docs/tablet.md
index b021d10..d4656be 100644
--- a/docs/design-docs/tablet.md
+++ b/docs/design-docs/tablet.md
@@ -194,7 +194,7 @@ When the MemRowSet fills up, a Flush occurs, which persists the data to disk.
 | DiskRowSet 0 |  | DiskRowSet 1 | .. | DiskRowSet N |
 +-------------+-  +--------------+    +--------------+
 ```
-When the data is flushed, it is stored as a set of CFiles (see src/kudu/cfile/README).
+When the data is flushed, it is stored as a set of CFiles (see cfile.md).
 Each of the rows in the data is addressable by a sequential "rowid", which is
 dense, immutable, and unique within this DiskRowSet. For example, if a given
 DiskRowSet contains 5 rows, then they will be assigned rowid 0 through 4, in
@@ -203,7 +203,7 @@ rows with the same rowids.
 
 Reads may map between primary keys (user-visible) and rowids (internal) using an index
 structure. In the case that the primary key is a simple key, the key structure is
-embedded within the primary key column's cfile. Otherwise, a separate index cfile
+embedded within the primary key column's CFile. Otherwise, a separate index CFile
 stores the encoded compound key and provides a similar function.
 
 NOTE: rowids are not explicitly stored with each row, but rather an implicit