You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ga...@apache.org on 2019/03/05 13:26:52 UTC
[parquet-format] branch master updated: PARQUET-1539: Clarify CRC
checksum in page header (#126)
This is an automated email from the ASF dual-hosted git repository.
gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new db23fe3 PARQUET-1539: Clarify CRC checksum in page header (#126)
db23fe3 is described below
commit db23fe3b7a141ee6b0903af089cbc2bc22a43f97
Author: Boudewijn Braams <36...@users.noreply.github.com>
AuthorDate: Tue Mar 5 14:26:48 2019 +0100
PARQUET-1539: Clarify CRC checksum in page header (#126)
---
README.md | 4 +++-
src/main/thrift/parquet.thrift | 25 +++++++++++++++++++++++--
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/README.md b/README.md
index c759be9..01193ae 100644
--- a/README.md
+++ b/README.md
@@ -195,7 +195,9 @@ the reasoning behind adding these to the format.
## Checksumming
Data pages can be individually checksummed. This allows disabling of checksums at the
-HDFS file level, to better support single row lookups.
+HDFS file level, to better support single row lookups. Data page checksums are calculated
+using the standard CRC32 algorithm on the compressed data of a page (not including the
+page header itself).
## Error recovery
If the file metadata is corrupt, the file is lost. If the column metadata is corrupt,
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 7a29b80..4272cc3 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -604,8 +604,29 @@ struct PageHeader {
/** Compressed page size in bytes (not including this header) **/
3: required i32 compressed_page_size
- /** 32bit crc for the data below. This allows for disabling checksumming in HDFS
- * if only a few pages needs to be read
+ /** The 32bit CRC for the page, to be be calculated as follows:
+ * - Using the standard CRC32 algorithm
+ * - On the data only, i.e. this header should not be included. 'Data'
+ * hereby refers to the concatenation of the repetition levels, the
+ * definition levels and the column value, in this exact order.
+ * - On the encoded versions of the repetition levels, definition levels and
+ * column values
+ * - On the compressed versions of the repetition levels, definition levels
+ * and column values where possible;
+ * - For v1 data pages, the repetition levels, definition levels and column
+ * values are always compressed together. If a compression scheme is
+ * specified, the CRC shall be calculated on the compressed version of
+ * this concatenation. If no compression scheme is specified, the CRC
+ * shall be calculated on the uncompressed version of this concatenation.
+ * - For v2 data pages, the repetition levels and definition levels are
+ * handled separately from the data and are never compressed (only
+ * encoded). If a compression scheme is specified, the CRC shall be
+ * calculated on the concatenation of the uncompressed repetition levels,
+ * uncompressed definition levels and the compressed column values.
+ * If no compression scheme is specified, the CRC shall be calculated on
+ * the uncompressed concatenation.
+ * If enabled, this allows for disabling checksumming in HDFS if only a few
+ * pages need to be read.
**/
4: optional i32 crc