You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2019/12/03 08:35:00 UTC

[parquet-format] branch master updated: PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144)

This is an automated email from the ASF dual-hosted git repository.

gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new ee02ef8  PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144)
ee02ef8 is described below

commit ee02ef8c8f33bd3d5ed0582ded7e20439e12d933
Author: martinradev <ma...@gmail.com>
AuthorDate: Tue Dec 3 08:34:53 2019 +0000

    PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144)
    
    The patch extends the format to add the BYTE_STREAM_SPLIT
    encoding and adds documentation for it.
---
 Encodings.md                   | 24 ++++++++++++++++++++++++
 src/main/thrift/parquet.thrift |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/Encodings.md b/Encodings.md
index 236d8b2..4f56104 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -261,3 +261,27 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding
 
 This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by
 the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
+
+### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
+
+Supported Types: FLOAT DOUBLE
+
+This encoding does not reduce the size of the data but can lead to a significantly better
+compression ratio and speed when a compression algorithm is used afterwards.
+
+This encoding creates K byte-streams of length N where K is the size in bytes of the data
+type and N is the number of elements in the data sequence.
+The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the
+0-th stream, the 1-st byte goes to the 1-st stream and so on.
+The streams are concatenated in the following order: 0-th stream, 1-st stream, etc.
+
+Example:
+Original data is three 32-bit floats and for simplicity we look at their raw representation.
+```
+       Element 0      Element 1      Element 2
+Bytes  AA BB CC DD    00 11 22 33    A3 B4 C5 D6
+```
+After applying the transformation, the data has the following representation:
+```
+Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
+```
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 68820ca..0c1a8ea 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -457,6 +457,15 @@ enum Encoding {
   /** Dictionary encoding: the ids are encoded using the RLE encoding
    */
   RLE_DICTIONARY = 8;
+
+  /** Encoding for floating-point data.
+      K byte-streams are created where K is the size in bytes of the data type.
+      The individual bytes of an FP value are scattered to the corresponding stream and
+      the streams are concatenated.
+      This itself does not reduce the size of the data but can lead to better compression
+      afterwards.
+   */
+  BYTE_STREAM_SPLIT = 9;
 }
 
 /**