You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2019/03/05 13:26:52 UTC

[parquet-format] branch master updated: PARQUET-1539: Clarify CRC checksum in page header (#126)

This is an automated email from the ASF dual-hosted git repository.

gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new db23fe3  PARQUET-1539: Clarify CRC checksum in page header (#126)
db23fe3 is described below

commit db23fe3b7a141ee6b0903af089cbc2bc22a43f97
Author: Boudewijn Braams <36...@users.noreply.github.com>
AuthorDate: Tue Mar 5 14:26:48 2019 +0100

    PARQUET-1539: Clarify CRC checksum in page header (#126)
---
 README.md                      |  4 +++-
 src/main/thrift/parquet.thrift | 25 +++++++++++++++++++++++--
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index c759be9..01193ae 100644
--- a/README.md
+++ b/README.md
@@ -195,7 +195,9 @@ the reasoning behind adding these to the format.
 
 ## Checksumming
 Data pages can be individually checksummed.  This allows disabling of checksums at the
-HDFS file level, to better support single row lookups.
+HDFS file level, to better support single row lookups. Data page checksums are calculated
+using the standard CRC32 algorithm on the compressed data of a page (not including the
+page header itself).
 
 ## Error recovery
 If the file metadata is corrupt, the file is lost.  If the column metadata is corrupt,
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 7a29b80..4272cc3 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -604,8 +604,29 @@ struct PageHeader {
   /** Compressed page size in bytes (not including this header) **/
   3: required i32 compressed_page_size
 
-  /** 32bit crc for the data below. This allows for disabling checksumming in HDFS
-   *  if only a few pages needs to be read
+  /** The 32bit CRC for the page, to be be calculated as follows:
+   * - Using the standard CRC32 algorithm
+   * - On the data only, i.e. this header should not be included. 'Data'
+   *   hereby refers to the concatenation of the repetition levels, the
+   *   definition levels and the column value, in this exact order.
+   * - On the encoded versions of the repetition levels, definition levels and
+   *   column values
+   * - On the compressed versions of the repetition levels, definition levels
+   *   and column values where possible;
+   *   - For v1 data pages, the repetition levels, definition levels and column
+   *     values are always compressed together. If a compression scheme is
+   *     specified, the CRC shall be calculated on the compressed version of
+   *     this concatenation. If no compression scheme is specified, the CRC
+   *     shall be calculated on the uncompressed version of this concatenation.
+   *   - For v2 data pages, the repetition levels and definition levels are
+   *     handled separately from the data and are never compressed (only
+   *     encoded). If a compression scheme is specified, the CRC shall be
+   *     calculated on the concatenation of the uncompressed repetition levels,
+   *     uncompressed definition levels and the compressed column values.
+   *     If no compression scheme is specified, the CRC shall be calculated on
+   *     the uncompressed concatenation.
+   * If enabled, this allows for disabling checksumming in HDFS if only a few
+   * pages need to be read.
    **/
   4: optional i32 crc