You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by ap...@apache.org on 2023/11/09 16:54:43 UTC
(parquet-testing) branch master updated: PARQUET-758: Add files with Float16 column (#40)
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push:
new 506afff PARQUET-758: Add files with Float16 column (#40)
506afff is described below
commit 506afff9b6957ffe10d08470d467867d43e1bb91
Author: Ben Harkins <60...@users.noreply.github.com>
AuthorDate: Thu Nov 9 11:54:37 2023 -0500
PARQUET-758: Add files with Float16 column (#40)
---------
Co-authored-by: Antoine Pitrou <pi...@free.fr>
---
data/README.md | 100 ++++++++++++++++++++++++++++++++-
data/float16_nonzeros_and_nans.parquet | Bin 0 -> 505 bytes
data/float16_zeros_and_nans.parquet | Bin 0 -> 493 bytes
3 files changed, 98 insertions(+), 2 deletions(-)
diff --git a/data/README.md b/data/README.md
index b8534fe..3b6cae7 100644
--- a/data/README.md
+++ b/data/README.md
@@ -45,6 +45,8 @@
| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
+| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values |
+| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below |
TODO: Document what each file is in the table above.
@@ -94,7 +96,7 @@ The schema for the `datapage_v1-*-checksum.parquet` test files is:
message m {
required int32 a;
required int32 b;
-}
+}
```
The detailed structure for these files is as follows:
@@ -182,7 +184,7 @@ metadata = pq.read_metadata("nan_in_stats.parquet")
metadata.row_group(0).column(0)
# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
# file_offset: 88
-# file_path:
+# file_path:
# type: DOUBLE
# num_values: 2
# path_in_schema: x
@@ -223,3 +225,97 @@ pq.write_table(tab, "test.parquet", compression='BROTLI')
It is meant to exercise reading of structured data where each value
is smaller than 2GB but the combined uncompressed column chunk size
is greater than 2GB.
+
+## Float16 Files
+
+The files `float16_zeros_and_nans.parquet` and `float16_nonzeros_and_nans.parquet`
+are meant to exercise a variety of test cases regarding `Float16` columns (which
+are represented as 2-byte `FixedLenByteArray`s), including:
+* Basic binary representations of standard values, +/- zeros, and NaN
+* Comparisons between finite values
+* Exclusion of NaNs from statistics min/max
+* Normalizing min/max values when only zeros are present (i.e. `min` is always -0 and `max` is always +0)
+
+The aforementioned files were generated with:
+
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+import numpy as np
+
+t1 = pa.Table.from_arrays(
+ [pa.array([None,
+ np.float16(0.0),
+ np.float16(np.NaN)], type=pa.float16())],
+ names="x")
+t2 = pa.Table.from_arrays(
+ [pa.array([None,
+ np.float16(1.0),
+ np.float16(-2.0),
+ np.float16(np.NaN),
+ np.float16(0.0),
+ np.float16(-1.0),
+ np.float16(-0.0),
+ np.float16(2.0)],
+ type=pa.float16())],
+ names="x")
+
+pq.write_table(t1, "float16_zeros_and_nans.parquet")
+pq.write_table(t2, "float16_nonzeros_and_nans.parquet")
+
+m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
+m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")
+
+print(m1.row_group(0).column(0))
+print(m2.row_group(0).column(0))
+# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
+# file_offset: 72
+# file_path:
+# physical_type: FIXED_LEN_BYTE_ARRAY
+# num_values: 3
+# path_in_schema: x
+# is_stats_set: True
+# statistics:
+# <pyarrow._parquet.Statistics object at 0x7f24d48c4ea0>
+# has_min_max: True
+# min: b'\x00\x80'
+# max: b'\x00\x00'
+# null_count: 1
+# distinct_count: None
+# num_values: 2
+# physical_type: FIXED_LEN_BYTE_ARRAY
+# logical_type: Float16
+# converted_type (legacy): NONE
+# compression: SNAPPY
+# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
+# has_dictionary_page: True
+# dictionary_page_offset: 4
+# data_page_offset: 24
+# total_compressed_size: 68
+# total_uncompressed_size: 64
+# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
+# file_offset: 84
+# file_path:
+# physical_type: FIXED_LEN_BYTE_ARRAY
+# num_values: 8
+# path_in_schema: x
+# is_stats_set: True
+# statistics:
+# <pyarrow._parquet.Statistics object at 0x7f24d48c4e50>
+# has_min_max: True
+# min: b'\x00\xc0'
+# max: b'\x00@'
+# null_count: 1
+# distinct_count: None
+# num_values: 7
+# physical_type: FIXED_LEN_BYTE_ARRAY
+# logical_type: Float16
+# converted_type (legacy): NONE
+# compression: SNAPPY
+# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
+# has_dictionary_page: True
+# dictionary_page_offset: 4
+# data_page_offset: 34
+# total_compressed_size: 80
+# total_uncompressed_size: 76
+```
diff --git a/data/float16_nonzeros_and_nans.parquet b/data/float16_nonzeros_and_nans.parquet
new file mode 100644
index 0000000..eecebde
Binary files /dev/null and b/data/float16_nonzeros_and_nans.parquet differ
diff --git a/data/float16_zeros_and_nans.parquet b/data/float16_zeros_and_nans.parquet
new file mode 100644
index 0000000..61ea6ce
Binary files /dev/null and b/data/float16_zeros_and_nans.parquet differ