You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2021/03/11 19:55:11 UTC
[parquet-format] branch master updated: PARQUET-1996: [Format] Add
interoperable LZ4 codec, deprecate existing LZ4 codec (#168)
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 7f06e83 PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec (#168)
7f06e83 is described below
commit 7f06e838cbd1b7dbd722ff2580b9c2525e37fc46
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Thu Mar 11 20:55:02 2021 +0100
PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec (#168)
---
Compression.md | 97 ++++++++++++++++++++++++++++++++++++++++++
README.md | 3 ++
src/main/thrift/parquet.thrift | 14 +++---
3 files changed, 108 insertions(+), 6 deletions(-)
diff --git a/Compression.md b/Compression.md
new file mode 100644
index 0000000..43abe8c
--- /dev/null
+++ b/Compression.md
@@ -0,0 +1,97 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one
+ - or more contributor license agreements. See the NOTICE file
+ - distributed with this work for additional information
+ - regarding copyright ownership. The ASF licenses this file
+ - to you under the Apache License, Version 2.0 (the
+ - "License"); you may not use this file except in compliance
+ - with the License. You may obtain a copy of the License at
+ -
+ - http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing,
+ - software distributed under the License is distributed on an
+ - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ - KIND, either express or implied. See the License for the
+ - specific language governing permissions and limitations
+ - under the License.
+ -->
+
+# Parquet compression definitions
+
+This document contains the specification of all supported compression codecs.
+
+## Overview
+
+Parquet allows the data block inside dictionary pages and data pages to
+be compressed for better space efficiency. The Parquet format supports
+several compression covering different areas in the compression ratio /
+processing cost spectrum.
+
+The detailed specifications of compression codecs are maintained externally
+by their respective authors or maintainers, which we reference hereafter.
+
+For all compression codecs except the deprecated `LZ4` codec, the raw data
+of a (data or dictionary) page is fed *as-is* to the underlying compression
+library, without any additional framing or padding. The information required
+for precise allocation of compressed and decompressed buffers is written
+in the `PageHeader` struct.
+
+## Codecs
+
+### UNCOMPRESSED
+
+No-op codec. Data is left uncompressed.
+
+### SNAPPY
+
+A codec based on the
+[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
+If any ambiguity arises when implementing this format, the implementation
+provided by Google Snappy [library](https://github.com/google/snappy/)
+is authoritative.
+
+### GZIP
+
+A codec based on the GZIP format (not the closely-related "zlib" or "deflate"
+formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [zlib compression library](https://zlib.net/) is authoritative.
+
+### LZO
+
+A codec based on or interoperable with the
+[LZO compression library](http://www.oberhumer.com/opensource/lzo/).
+
+### BROTLI
+
+A codec based on the Brotli format defined by
+[RFC 7932](https://tools.ietf.org/html/rfc7932).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [Brotli compression library](https://github.com/google/brotli)
+is authoritative.
+
+### LZ4
+
+A **deprecated** codec loosely based on the LZ4 compression algorithm,
+but with an additional undocumented framing scheme. The framing is part
+of the original Hadoop compression library and was historically copied
+first in parquet-mr, then emulated with mixed results by parquet-cpp.
+
+It is strongly suggested that implementors of Parquet writers deprecate
+this compression codec in their user-facing APIs, and advise users to
+switch to the newer, interoperable `LZ4_RAW` codec.
+
+### ZSTD
+
+A codec based on the Zstandard format defined by
+[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises
+when implementing this format, the implementation provided by the
+[ZStandard compression library](https://facebook.github.io/zstd/)
+is authoritative.
+
+### LZ4_RAW
+
+A codec based on the [LZ4 block format](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [LZ4 compression library](http://www.lz4.org/) is authoritative.
diff --git a/README.md b/README.md
index 3f83790..85ef6d6 100644
--- a/README.md
+++ b/README.md
@@ -183,6 +183,9 @@ page is only the encoded values.
The supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
+The supported compression codecs are described in
+[Compression.md](https://github.com/apache/parquet-format/blob/master/Compression.md)
+
## Column chunks
Column chunks are composed of pages written back to back. The pages share a common
header and readers can skip over pages they are not interested in. The data for the
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 0e091d7..24088c1 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -471,19 +471,21 @@ enum Encoding {
/**
* Supported compression algorithms.
*
- * Codecs added in 2.4 can be read by readers based on 2.4 and later.
+ * Codecs added in format version X.Y can be read by readers based on X.Y and later.
* Codec support may vary between readers based on the format version and
- * libraries available at runtime. Gzip, Snappy, and LZ4 codecs are
- * widely available, while Zstd and Brotli require additional libraries.
+ * libraries available at runtime.
+ *
+ * See Compression.md for a detailed specification of these algorithms.
*/
enum CompressionCodec {
UNCOMPRESSED = 0;
SNAPPY = 1;
GZIP = 2;
LZO = 3;
- BROTLI = 4; // Added in 2.4
- LZ4 = 5; // Added in 2.4
- ZSTD = 6; // Added in 2.4
+ BROTLI = 4; // Added in 2.4
+ LZ4 = 5; // DEPRECATED (Added in 2.4)
+ ZSTD = 6; // Added in 2.4
+ LZ4_RAW = 7; // Added in 2.9
}
enum PageType {