You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "ajantha-bhat (via GitHub)" <gi...@apache.org> on 2023/06/06 12:16:22 UTC
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1219531635


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information
+for every partition value as a row in the **table default format** sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   > The schema of PartitionData is based on specific partition spec
   
   It includes all the fields from all the specs. 
   https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionsTable.java#L50
   
   more info on the schema
   https://github.com/apache/iceberg/blob/cfa090531e955911e792e24f3d14103c69a63c63/core/src/main/java/org/apache/iceberg/Partitioning.java#L245
   
   > And we don't nedd a spec_id field
   
   in the case of partition evolution, there can be partitions based on old and new spec. In that case, spec_id field can be helpful to know these partition stats belong to which spec. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org