You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2019/12/03 08:35:00 UTC
[parquet-format] branch master updated: PARQUET-1622: Add
BYTE_STREAM_SPLIT encoding (#144)
This is an automated email from the ASF dual-hosted git repository.
gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new ee02ef8 PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144)
ee02ef8 is described below
commit ee02ef8c8f33bd3d5ed0582ded7e20439e12d933
Author: martinradev <ma...@gmail.com>
AuthorDate: Tue Dec 3 08:34:53 2019 +0000
PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144)
The patch extends the format to add the BYTE_STREAM_SPLIT
encoding and adds documentation for it.
---
Encodings.md | 24 ++++++++++++++++++++++++
src/main/thrift/parquet.thrift | 9 +++++++++
2 files changed, 33 insertions(+)
diff --git a/Encodings.md b/Encodings.md
index 236d8b2..4f56104 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -261,3 +261,27 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by
the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
+
+### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
+
+Supported Types: FLOAT DOUBLE
+
+This encoding does not reduce the size of the data but can lead to a significantly better
+compression ratio and speed when a compression algorithm is used afterwards.
+
+This encoding creates K byte-streams of length N where K is the size in bytes of the data
+type and N is the number of elements in the data sequence.
+The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the
+0-th stream, the 1-st byte goes to the 1-st stream and so on.
+The streams are concatenated in the following order: 0-th stream, 1-st stream, etc.
+
+Example:
+Original data is three 32-bit floats and for simplicity we look at their raw representation.
+```
+ Element 0 Element 1 Element 2
+Bytes AA BB CC DD 00 11 22 33 A3 B4 C5 D6
+```
+After applying the transformation, the data has the following representation:
+```
+Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
+```
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 68820ca..0c1a8ea 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -457,6 +457,15 @@ enum Encoding {
/** Dictionary encoding: the ids are encoded using the RLE encoding
*/
RLE_DICTIONARY = 8;
+
+ /** Encoding for floating-point data.
+ K byte-streams are created where K is the size in bytes of the data type.
+ The individual bytes of an FP value are scattered to the corresponding stream and
+ the streams are concatenated.
+ This itself does not reduce the size of the data but can lead to better compression
+ afterwards.
+ */
+ BYTE_STREAM_SPLIT = 9;
}
/**