You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2021/03/11 19:55:11 UTC

[parquet-format] branch master updated: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec (#168)

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 7f06e83  PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec (#168)
7f06e83 is described below

commit 7f06e838cbd1b7dbd722ff2580b9c2525e37fc46
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Thu Mar 11 20:55:02 2021 +0100

    PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec (#168)
---
 Compression.md                 | 97 ++++++++++++++++++++++++++++++++++++++++++
 README.md                      |  3 ++
 src/main/thrift/parquet.thrift | 14 +++---
 3 files changed, 108 insertions(+), 6 deletions(-)

diff --git a/Compression.md b/Compression.md
new file mode 100644
index 0000000..43abe8c
--- /dev/null
+++ b/Compression.md
@@ -0,0 +1,97 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet compression definitions
+
+This document contains the specification of all supported compression codecs.
+
+## Overview
+
+Parquet allows the data block inside dictionary pages and data pages to
+be compressed for better space efficiency. The Parquet format supports
+several compression covering different areas in the compression ratio /
+processing cost spectrum.
+
+The detailed specifications of compression codecs are maintained externally
+by their respective authors or maintainers, which we reference hereafter.
+
+For all compression codecs except the deprecated `LZ4` codec, the raw data
+of a (data or dictionary) page is fed *as-is* to the underlying compression
+library, without any additional framing or padding.  The information required
+for precise allocation of compressed and decompressed buffers is written
+in the `PageHeader` struct.
+
+## Codecs
+
+### UNCOMPRESSED
+
+No-op codec.  Data is left uncompressed.
+
+### SNAPPY
+
+A codec based on the
+[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
+If any ambiguity arises when implementing this format, the implementation
+provided by Google Snappy [library](https://github.com/google/snappy/)
+is authoritative.
+
+### GZIP
+
+A codec based on the GZIP format (not the closely-related "zlib" or "deflate"
+formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [zlib compression library](https://zlib.net/) is authoritative.
+
+### LZO
+
+A codec based on or interoperable with the
+[LZO compression library](http://www.oberhumer.com/opensource/lzo/).
+
+### BROTLI
+
+A codec based on the Brotli format defined by
+[RFC 7932](https://tools.ietf.org/html/rfc7932).
+If any ambiguity arises when implementing this format, the implementation
+provided by the  [Brotli compression library](https://github.com/google/brotli)
+is authoritative.
+
+### LZ4
+
+A **deprecated** codec loosely based on the LZ4 compression algorithm,
+but with an additional undocumented framing scheme.  The framing is part
+of the original Hadoop compression library and was historically copied
+first in parquet-mr, then emulated with mixed results by parquet-cpp.
+
+It is strongly suggested that implementors of Parquet writers deprecate
+this compression codec in their user-facing APIs, and advise users to
+switch to the newer, interoperable `LZ4_RAW` codec.
+
+### ZSTD
+
+A codec based on the Zstandard format defined by
+[RFC 8478](https://tools.ietf.org/html/rfc8478).  If any ambiguity arises
+when implementing this format, the implementation provided by the
+[ZStandard compression library](https://facebook.github.io/zstd/)
+is authoritative.
+
+### LZ4_RAW
+
+A codec based on the [LZ4 block format](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [LZ4 compression library](http://www.lz4.org/) is authoritative.
diff --git a/README.md b/README.md
index 3f83790..85ef6d6 100644
--- a/README.md
+++ b/README.md
@@ -183,6 +183,9 @@ page is only the encoded values.
 
 The supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
 
+The supported compression codecs are described in
+[Compression.md](https://github.com/apache/parquet-format/blob/master/Compression.md)
+
 ## Column chunks
 Column chunks are composed of pages written back to back.  The pages share a common
 header and readers can skip over pages they are not interested in.  The data for the
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 0e091d7..24088c1 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -471,19 +471,21 @@ enum Encoding {
 /**
  * Supported compression algorithms.
  *
- * Codecs added in 2.4 can be read by readers based on 2.4 and later.
+ * Codecs added in format version X.Y can be read by readers based on X.Y and later.
  * Codec support may vary between readers based on the format version and
- * libraries available at runtime. Gzip, Snappy, and LZ4 codecs are
- * widely available, while Zstd and Brotli require additional libraries.
+ * libraries available at runtime.
+ *
+ * See Compression.md for a detailed specification of these algorithms.
  */
 enum CompressionCodec {
   UNCOMPRESSED = 0;
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4
+  LZ4 = 5;     // DEPRECATED (Added in 2.4)
+  ZSTD = 6;    // Added in 2.4
+  LZ4_RAW = 7; // Added in 2.9
 }
 
 enum PageType {