You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2021/06/24 11:56:34 UTC
[parquet-format] branch master updated: Document dictionary page
position (#177)
This is an automated email from the ASF dual-hosted git repository.
gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 43c891a Document dictionary page position (#177)
43c891a is described below
commit 43c891a4494f85e2fe0e56f4ef408bcc60e8da48
Author: Gabor Szadovszky <ga...@apache.org>
AuthorDate: Thu Jun 24 13:56:23 2021 +0200
Document dictionary page position (#177)
---
README.md | 7 +++++++
src/main/thrift/parquet.thrift | 5 +++++
2 files changed, 12 insertions(+)
diff --git a/README.md b/README.md
index ac7c791..f5478c8 100644
--- a/README.md
+++ b/README.md
@@ -191,6 +191,13 @@ header and readers can skip over pages they are not interested in. The data for
page follows the header and can be compressed and/or encoded. The compression and
encoding is specified in the page metadata.
+A column chunk might be partly or completely dictionary encoded. It means that
+dictionary indexes are saved in the data pages instead of the actual values. The
+actual values are stored in the dictionary page. See details in
+[Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8).
+The dictionary page must be placed at the first position of the column chunk. At
+most one dictionary page can be placed in a column chunk.
+
Additionally, files can contain an optional column index to allow readers to
skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and
the reasoning behind adding these to the format.
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 05690e4..81a7cf8 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -534,6 +534,11 @@ struct IndexPageHeader {
// TODO
}
+/**
+ * The dictionary page must be placed at the first position of the column chunk
+ * if it is partly or completely dictionary encoded. At most one dictionary page
+ * can be placed in a column chunk.
+ **/
struct DictionaryPageHeader {
/** Number of values in the dictionary **/
1: required i32 num_values;