You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2022/11/23 09:31:46 UTC

[GitHub] [parquet-format] pitrou commented on a diff in pull request #187: PARQUET-2215: [Format] Document overflow handling in DELTA_BINARY_PACKED

pitrou commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030207739


##########
Encodings.md:
##########
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### <a name="DELTAENC"></a>Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various numbers (not the deltas themselves). For unsigned values, we use ULEB128, which is the unsigned version of LEB128 (https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we use zigzag encoding (https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 (https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding (https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values binary packed. Each block is made of miniblocks, each of them binary packed with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed with its own bit width.
 
 The header is defined as follows:
 ```
 <block size in values> <number of miniblocks in a block> <total value count> <first value>
 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that their quotient, the number of values in a miniblock, is a multiple of 32; it is stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
 <min delta> <list of bitwidths of miniblocks> <miniblocks>
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element in the block, use the last element in the previous block or, in the case of the first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). Subtract this min delta from all deltas in the block. This guarantees that all values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed by the bit widths of the miniblocks and the delta values (minus the min delta) bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing the frame of reference (the min delta) which can result in smaller values after the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in smaller values
+after the subtraction which, again, means we can store them with a lower bit width.
 
-If there are not enough values to fill the last miniblock, we pad the miniblock so that its length is always the number of values in a full miniblock multiplied by the bit width. The values of the padding bits should be zero, but readers must accept paddings consisting of arbitrary bits as well.
+If there are not enough values to fill the last miniblock, we pad the miniblock
+so that its length is always the number of values in a full miniblock multiplied
+by the bit width. The values of the padding bits should be zero, but readers
+must accept paddings consisting of arbitrary bits as well.
 
-If, in the last block, less than ```<number of miniblocks in a block>``` miniblocks are needed to store the values, the bytes storing the bit widths of the unneeded miniblocks are still present, their value should be zero, but readers must accept arbitrary values as well. There are no additional padding bytes for the miniblock bodies though, as if their bit widths were 0 (regardless of the actual byte values). The reader knows when to stop reading by keeping track of the number of values read.
+If, in the last block, less than ```<number of miniblocks in a block>```
+miniblocks are needed to store the values, the bytes storing the bit widths
+of the unneeded miniblocks are still present, their value should be zero,
+but readers must accept arbitrary values as well. There are no additional
+padding bytes for the miniblock bodies though, as if their bit widths were 0
+(regardless of the actual byte values). The reader knows when to stop reading
+by keeping track of the number of values read.
+
+Subtractions in steps 1) and 2) may incur signed arithmetic overflow. Overflow
+should be allowed and handled as wrapping around in 2's complement notation.
+This may require explicit care in some programming languages (for example by
+doing all arithmetic in the unsigned domain).

Review Comment:
   Isn't that implied?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org