You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by yy...@apache.org on 2021/04/28 21:33:15 UTC
[iceberg] branch master updated: Core: fix NPE in manifests table
for contains_nan column, update spec (#2521)
This is an automated email from the ASF dual-hosted git repository.
yyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new e8ca53b Core: fix NPE in manifests table for contains_nan column, update spec (#2521)
e8ca53b is described below
commit e8ca53b9794245fc89ce87cbf6630679c1301298
Author: yyanyy <ya...@amazon.com>
AuthorDate: Wed Apr 28 14:32:59 2021 -0700
Core: fix NPE in manifests table for contains_nan column, update spec (#2521)
---
.../java/org/apache/iceberg/ManifestsTable.java | 2 +-
site/docs/spark-queries.md | 33 ++++++++++++++--------
site/docs/spec.md | 3 +-
3 files changed, 24 insertions(+), 14 deletions(-)
diff --git a/core/src/main/java/org/apache/iceberg/ManifestsTable.java b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
index e7b9222..c75daf6 100644
--- a/core/src/main/java/org/apache/iceberg/ManifestsTable.java
+++ b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
@@ -38,7 +38,7 @@ public class ManifestsTable extends BaseMetadataTable {
Types.NestedField.required(7, "deleted_data_files_count", Types.IntegerType.get()),
Types.NestedField.required(8, "partition_summaries", Types.ListType.ofRequired(9, Types.StructType.of(
Types.NestedField.required(10, "contains_null", Types.BooleanType.get()),
- Types.NestedField.required(11, "contains_nan", Types.BooleanType.get()),
+ Types.NestedField.optional(11, "contains_nan", Types.BooleanType.get()),
Types.NestedField.optional(12, "lower_bound", Types.StringType.get()),
Types.NestedField.optional(13, "upper_bound", Types.StringType.get())
)))
diff --git a/site/docs/spark-queries.md b/site/docs/spark-queries.md
index f7a78b5..687606f 100644
--- a/site/docs/spark-queries.md
+++ b/site/docs/spark-queries.md
@@ -210,13 +210,13 @@ To show a table's data files and each file's metadata, run:
SELECT * FROM prod.db.table.files
```
```text
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] |
-| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] |
-| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] |
+| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] |
+| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
```
### Manifests
@@ -227,13 +227,22 @@ To show a table's file manifests and each file's metadata, run:
SELECT * FROM prod.db.table.manifests
```
```text
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partitions |
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,2019-05-13,2019-05-15]] |
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,null,2019-05-13,2019-05-15]] |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
```
+Note:
+1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](./spec.md#manifest-lists), with the following order:
+ - `contains_null`
+ - `contains_nan`
+ - `lower_bound`
+ - `upper_bound`
+2. `contains_nan` could return null, which indicates that this information is not available from files' metadata.
+ This usually occurs when reading from V1 table, where `contains_nan` is not populated.
+
## Inspecting with DataFrames
Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader API:
diff --git a/site/docs/spec.md b/site/docs/spec.md
index 6bcfd37..fa16261 100644
--- a/site/docs/spec.md
+++ b/site/docs/spec.md
@@ -431,6 +431,7 @@ Manifest list files store `manifest_file`, a struct with the following fields:
| v1 | v2 | Field id, name | Type | Description |
| ---------- | ---------- |-------------------------|---------------|-------------|
| _required_ | _required_ | **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field |
+| _optional_ | _required_ | **`518 contains_nan`** | `boolean` | Whether the manifest contains at least one partition with a NaN value for the field |
| _optional_ | _optional_ | **`510 lower_bound`** | `bytes` [1] | Lower bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] |
| _optional_ | _optional_ | **`511 upper_bound`** | `bytes` [1] | Upper bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] |
@@ -952,7 +953,7 @@ Writing v2 metadata:
* Table metadata now requires field `default-spec-id`.
* Table metadata now requires field `last-partition-id`.
* Table metadata field `partition-spec` is no longer required and may be omitted.
-* Snapshot added required field field `sequence-number`.
+* Snapshot added required field `sequence-number`.
* Snapshot now requires field `manifest-list`.
* Snapshot field `manifests` is no longer allowed.
* Table metadata now requires field `sort-orders`.