You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by yy...@apache.org on 2021/04/28 21:33:15 UTC

[iceberg] branch master updated: Core: fix NPE in manifests table for contains_nan column, update spec (#2521)

This is an automated email from the ASF dual-hosted git repository.

yyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git


The following commit(s) were added to refs/heads/master by this push:
     new e8ca53b  Core: fix NPE in manifests table for contains_nan column, update spec (#2521)
e8ca53b is described below

commit e8ca53b9794245fc89ce87cbf6630679c1301298
Author: yyanyy <ya...@amazon.com>
AuthorDate: Wed Apr 28 14:32:59 2021 -0700

    Core: fix NPE in manifests table for contains_nan column, update spec (#2521)
---
 .../java/org/apache/iceberg/ManifestsTable.java    |  2 +-
 site/docs/spark-queries.md                         | 33 ++++++++++++++--------
 site/docs/spec.md                                  |  3 +-
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/core/src/main/java/org/apache/iceberg/ManifestsTable.java b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
index e7b9222..c75daf6 100644
--- a/core/src/main/java/org/apache/iceberg/ManifestsTable.java
+++ b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
@@ -38,7 +38,7 @@ public class ManifestsTable extends BaseMetadataTable {
       Types.NestedField.required(7, "deleted_data_files_count", Types.IntegerType.get()),
       Types.NestedField.required(8, "partition_summaries", Types.ListType.ofRequired(9, Types.StructType.of(
           Types.NestedField.required(10, "contains_null", Types.BooleanType.get()),
-          Types.NestedField.required(11, "contains_nan", Types.BooleanType.get()),
+          Types.NestedField.optional(11, "contains_nan", Types.BooleanType.get()),
           Types.NestedField.optional(12, "lower_bound", Types.StringType.get()),
           Types.NestedField.optional(13, "upper_bound", Types.StringType.get())
       )))
diff --git a/site/docs/spark-queries.md b/site/docs/spark-queries.md
index f7a78b5..687606f 100644
--- a/site/docs/spark-queries.md
+++ b/site/docs/spark-queries.md
@@ -210,13 +210,13 @@ To show a table's data files and each file's metadata, run:
 SELECT * FROM prod.db.table.files
 ```
 ```text
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| file_path                                                               | file_format | record_count | file_size_in_bytes | column_sizes       | value_counts     | null_value_counts | lower_bounds    | upper_bounds    | key_metadata | split_offsets |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null         | [4]           |
-| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null         | [4]           |
-| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null         | [4]           |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| file_path                                                               | file_format | record_count | file_size_in_bytes | column_sizes       | value_counts     | null_value_counts | nan_value_counts | lower_bounds    | upper_bounds    | key_metadata | split_offsets |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null         | [4]           |
+| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null         | [4]           |
+| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null         | [4]           |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
 ```
 
 ### Manifests
@@ -227,13 +227,22 @@ To show a table's file manifests and each file's metadata, run:
 SELECT * FROM prod.db.table.manifests
 ```
 ```text
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| path                                                                 | length | partition_spec_id | added_snapshot_id   | added_data_files_count | existing_data_files_count | deleted_data_files_count | partitions                      |
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479   | 0                 | 6668963634911763636 | 8                      | 0                         | 0                        | [[false,2019-05-13,2019-05-15]] |
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| path                                                                 | length | partition_spec_id | added_snapshot_id   | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries                  |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479   | 0                 | 6668963634911763636 | 8                      | 0                         | 0                        | [[false,null,2019-05-13,2019-05-15]] |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
 ```
 
+Note: 
+1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](./spec.md#manifest-lists), with the following order: 
+   - `contains_null`
+   - `contains_nan`
+   - `lower_bound`
+   - `upper_bound`
+2. `contains_nan` could return null, which indicates that this information is not available from files' metadata. 
+   This usually occurs when reading from V1 table, where `contains_nan` is not populated. 
+
 ## Inspecting with DataFrames
 
 Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader API:
diff --git a/site/docs/spec.md b/site/docs/spec.md
index 6bcfd37..fa16261 100644
--- a/site/docs/spec.md
+++ b/site/docs/spec.md
@@ -431,6 +431,7 @@ Manifest list files store `manifest_file`, a struct with the following fields:
 | v1         | v2         | Field id, name          | Type          | Description |
 | ---------- | ---------- |-------------------------|---------------|-------------|
 | _required_ | _required_ | **`509 contains_null`** | `boolean`     | Whether the manifest contains at least one partition with a null value for the field |
+| _optional_ | _required_ | **`518 contains_nan`**  | `boolean`     | Whether the manifest contains at least one partition with a NaN value for the field |
 | _optional_ | _optional_ | **`510 lower_bound`**   | `bytes`   [1] | Lower bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] |
 | _optional_ | _optional_ | **`511 upper_bound`**   | `bytes`   [1] | Upper bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] |
 
@@ -952,7 +953,7 @@ Writing v2 metadata:
 * Table metadata now requires field `default-spec-id`.
 * Table metadata now requires field `last-partition-id`.
 * Table metadata field `partition-spec` is no longer required and may be omitted.
-* Snapshot added required field field `sequence-number`.
+* Snapshot added required field `sequence-number`.
 * Snapshot now requires field `manifest-list`.
 * Snapshot field `manifests` is no longer allowed.
 * Table metadata now requires field `sort-orders`.