You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/04/03 17:30:06 UTC

[GitHub] [parquet-format] pitrou commented on a change in pull request #170: WIP: [Docs] Add description of RLE/bit-packed encoder

pitrou commented on a change in pull request #170:
URL: https://github.com/apache/parquet-format/pull/170#discussion_r606686923



##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.

Review comment:
       "first" isn't descriptive. Do you mean "least significant"?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. Apache Arrow),

Review comment:
       Is there a missing word? Perhaps "for in-memory formats" ?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.

Review comment:
       This sentence is weird. Does it imply that integers larger than 8 bits cannot be represented? Otherwise, what is the point of this mention?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder

Review comment:
       Also, "encoding"

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.

Review comment:
       Why `4 + 4`? Do you mean including the first 4 bytes?
   
   Also, the sentence "The first 4 bytes are only used for this purpose" isn't very informative.

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces

Review comment:
       `4u8` looks weird, is it a Rust notation?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded

Review comment:
       You mean "the least significant bit of the header value"?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder

Review comment:
       "RLE/bit-packed" or "RLE/bit-packing"

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.

Review comment:
       Is this really true? The grammar says "we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8". So `h1 >> 1` is not the number of bytes, it's the number of values times 8. The number of bytes also depends on the bit width (presumably, `body_length_in_bytes = (h1 >> 1) * bit_width`?).

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.

Review comment:
       I don't understand what the last part of the sentence says. Booleans don't have a LSB (well, they have one, but it's the MSB as well...).

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings

Review comment:
       "bit-packing"?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.

Review comment:
       Ok, but how is the `bit_width` determined? Perhaps something like: "The `bit_width` is the width of the underlying physical type, for example 32 when encoding a stream of 32-bit integers".

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.

Review comment:
       I don't understand what this means. "LSB" is not an encoding.

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,

Review comment:
       `i32` means it's a signed integer? Also, above you used the `uint32` notation, so this should be consistent (`int32`?).

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.

Review comment:
       `body_length = ceil(bit_width, 8)` perhaps?

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. Apache Arrow),
+the bodies can be memcopied _as is_ to a LSB buffer.
+
+### Example
+
+Let's now consider the the following byte stream and a `bit_width = 1`,

Review comment:
       Duplicate "the".

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. Apache Arrow),
+the bodies can be memcopied _as is_ to a LSB buffer.

Review comment:
       Either use plural or singular, not a mix thereof. Also, I don't know what a "LSB buffer" is.

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines

Review comment:
       "encoding" rather than "encoder". The encoding is implemented by an encoder + decoder.

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. Apache Arrow),
+the bodies can be memcopied _as is_ to a LSB buffer.
+
+### Example
+
+Let's now consider the the following byte stream and a `bit_width = 1`,
+
+```
+[
+    5, 0, 0, 0,
+    0b00000101, 0b11101011, 0b00000010,
+    0b00010000, 0b00000001,
+    0b00000101,
+    0b00000101,
+]
+```
+
+which could appear in a definition level of a non-nested Parquet type.
+
+We use the first 4 bytes to identify relevant length, within the stream,
+of the encoded bytes. This corresponds to `[5, 0, 0, 0]`, which, in little
+endian, corresponds to `5u32`. We can thus assume that all runs of this encoded
+stream are encoded in 
+
+```
+[
+    0b00000101, 0b11101011, 0b00000010,
+    0b00010000, 0b00000001,
+]
+```
+
+The ULEB128-decoding of this stream yields `h1 = 5i32` and length 1, i.e. the
+first byte is enough to represent the header of the first run. Because `5 & 1 == 1`,
+we conclude that the first run is bitpacked-encoded. `body_length = 5i32 >> 1 = 2`,
+which means that the next two bytes form the body. Bit-unpacking these two
+bytes yields (in `i32`), 
+
+```
+[
+    1, 1, 0, 1, 0, 1, 1, 1, 
+    0, 1, 0, 0, 0, 0, 0, 0
+]
+```
+
+(to understand why this is the case, read `0b11101011, 0b00000010` left to right,
+and bits within a byte right to left).
+
+We proceed through the stream, from which `[0b00010000, 0b00000001]` remains,
+and repeat the operation: the ULEB128-decoding yields `h1 = 16i32`. `16 & 1 == 0`
+and thus the next run is RLE-encoded. We compute the number of repetitions
+via `repetitions = h1 >> 1 = 8`. The body size is `body_length = ceil(bit_width, 8) = 1`,
+which is consistent with `length = 5 = 1 + 2 + 1 + 1`. Since the body is
+`0b00000001 = 1`, we conclude that this run is the number 1 repeated 8 times,

Review comment:
       Is it? `h1 >> 1` is the number of repetitions divided by 8, according to the grammar.

##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width`
+indicating the number of bits necessary to represent the largest encoded integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of the body of the first run.
+
+The header MUST be a single [ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, `body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types (e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. Apache Arrow),
+the bodies can be memcopied _as is_ to a LSB buffer.
+
+### Example
+
+Let's now consider the the following byte stream and a `bit_width = 1`,
+
+```
+[
+    5, 0, 0, 0,
+    0b00000101, 0b11101011, 0b00000010,
+    0b00010000, 0b00000001,
+    0b00000101,
+    0b00000101,
+]
+```
+
+which could appear in a definition level of a non-nested Parquet type.
+
+We use the first 4 bytes to identify relevant length, within the stream,
+of the encoded bytes. This corresponds to `[5, 0, 0, 0]`, which, in little
+endian, corresponds to `5u32`. We can thus assume that all runs of this encoded
+stream are encoded in 

Review comment:
       So why are there trailing values above?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org