You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@parquet.apache.org by ga...@apache.org on 2023/03/06 11:34:42 UTC

[parquet-testing] branch master updated (c71ce28 -> b2e7cc7)

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git


 discard c71ce28  [ARROW][GH-34147] Add test files for dictionary page crc
    omit 2308cbd  Apply suggestions from code review
    omit 872ade3  remove duplicate files
    omit 239b9ca  re-generate v2, because parquet-mr default use v1
    omit a6dab13  tiny fix readme format
    omit 32c5db5  Adding testfile for dict-crc
     new b2e7cc7  Add test files for dictionary page crc

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (c71ce28)
            \
             N -- N -- N   refs/heads/master (b2e7cc7)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:

[parquet-testing] 01/01: Add test files for dictionary page crc

Posted by ga...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git

commit b2e7cc755159196e3a068c8594f7acbaecfdaaac
Author: mwish <ma...@gmail.com>
AuthorDate: Thu Feb 23 01:06:25 2023 +0800

    Add test files for dictionary page crc
---
 data/README.md                                     |  21 +++++++++++++++++++++
 data/plain-dict-uncompressed-checksum.parquet      | Bin 0 -> 816 bytes
 data/rle-dict-snappy-checksum.parquet              | Bin 0 -> 822 bytes
 .../rle-dict-uncompressed-corrupt-checksum.parquet | Bin 0 -> 814 bytes
 4 files changed, 21 insertions(+)

diff --git a/data/README.md b/data/README.md
index dd25ade..638f0d1 100644
--- a/data/README.md
+++ b/data/README.md
@@ -41,6 +41,9 @@
 | bloom_filter.bin                               | deprecated bloom filter binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with thrift header and xxhash hashing    |
 | nan_in_stats.parquet                           | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
+| rle-dict-snappy-checksum.parquet                 | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
+| plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
+| rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
 
 TODO: Document what each file is in the table above.
 
@@ -111,6 +114,24 @@ The detailed structure for these files is as follows:
   [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]]
   ```
 
+The schema for the `*-dict-*-checksum.parquet` test files is:
+* `data/rle-dict-snappy-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+  ```
+
+* `data/plain-dict-uncompressed-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
+  ```
+
+* `data/rle-dict-uncompressed-corrupt-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+  ```
 ## Bloom Filter Files
 
 Bloom filter examples have been generated by parquet-mr.
diff --git a/data/plain-dict-uncompressed-checksum.parquet b/data/plain-dict-uncompressed-checksum.parquet
new file mode 100644
index 0000000..f49f1c4
Binary files /dev/null and b/data/plain-dict-uncompressed-checksum.parquet differ
diff --git a/data/rle-dict-snappy-checksum.parquet b/data/rle-dict-snappy-checksum.parquet
new file mode 100644
index 0000000..4c183d8
Binary files /dev/null and b/data/rle-dict-snappy-checksum.parquet differ
diff --git a/data/rle-dict-uncompressed-corrupt-checksum.parquet b/data/rle-dict-uncompressed-corrupt-checksum.parquet
new file mode 100644
index 0000000..20e23aa
Binary files /dev/null and b/data/rle-dict-uncompressed-corrupt-checksum.parquet differ