You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "ajantha-bhat (via GitHub)" <gi...@apache.org> on 2023/03/14 16:10:16 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #7105: Docs: Add partition stats spec

ajantha-bhat opened a new pull request, #7105:
URL: https://github.com/apache/iceberg/pull/7105

   Adding the spec changes for partition stats. 
   
   Based on the discussed design:
   https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1471431181

   >  I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.
   
   Yeah. Looking forward to the test results. 
   
   We may end up supporting both copy-on-write (single output file) and merge-on-read (multiple output files) as it boils down to whether the user wants faster read time or writing time. Initially, we may just support a single output file. But yeah, let us discuss more/conclude based on the test results. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275215782


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       
+`partition-stats-${snapshotId}.${tableDefaultFormat}`

Review Comment:
   What is the value of having a format here? Are there other places in the spec where we require specific names? I don't think we generally put requirements on these.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1238084717


##########
format/spec.md:
##########
@@ -702,6 +703,48 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 

Review Comment:
   Maybe for the initial implementation, It should be fine to use the table's default format itself.
   
   It can be enhanced further with a table property based on the need later on.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1147965833


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   I'm wondering if we should also have a "last_modified" timestamp here. I think occasionally this comes up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135890688


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   Should these be optional since v1 tables do NOT have them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sgcowell commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "sgcowell (via GitHub)" <gi...@apache.org>.
sgcowell commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316573076


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   I think it would be reasonable to make this an actual record count for the partition even with the presence of row-level deletes.   Writers that produce positional deletes would be able to produce this count directly during writes.  Equality delete writers like Flink wouldn't be able to produce the actual count, in which case they shouldn't write any partition stats for the snapshot.  Either an async job as suggested, or regularly scheduled compaction jobs, could produce the partition stats in that case.  If partition stats are not present, readers can either choose to use partition stats from earlier snapshots - e.g. from the last compaction - or just fall back to estimates based on other sources (full table rowcount, etc.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317408295


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |

Review Comment:
   I thought the sequence number is used as a quick alternative to snapshot id checkpoint during taking decision about whether to apply deletes or not. 
   
   Same sequence number info is stored for puffin files also along with snapshot id. https://iceberg.apache.org/spec/#table-statistics
   
   But I guess your point is we can fetch this info from snapshot anytime and why to store again? 
   
   @rdblue suggested this. I think he can add more info. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317696525


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   We would need to read all of the valid positional delete files to know the number of unique deletes for a data file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sgcowell commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "sgcowell (via GitHub)" <gi...@apache.org>.
sgcowell commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317719144


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   @ajantha-bhat I think file count should remain just the data file count.   Mixing the two reduces the value of the count in my opinion.   Knowing delete file counts isn't that useful in my experience, there's not a lot of choices the planner can make regarding how delete files are handled, even if there are a lot of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332135411


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   Not a native speaker, so I searched around. Seems `file count`, `record count` is the right way to go.
   
   > The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count").
   
   > So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count."



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   Not a native speaker, so I searched around. Seems `file count`, `record count` is the right way to go.
   
   > The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count").
   
   > So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbonofre commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332043990


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   Personally, I think it's good this way: clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1224846060


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   This shouldn't link to code. Why not use the same description as in the manifest file definition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209511961


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               

Review Comment:
   What about file size rollups as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1200997350


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   FYI: my PR #7581 tried to provide last modified timestamp and corresponding snapshotId for partition metadata table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbonofre commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1324111754


##########
format/spec.md:
##########
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table. 

Review Comment:
   @ajantha-bhat thanks for the update !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331888975


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |

Review Comment:
   Since it is a single file, I thought adding `file` keyword gives more clarity. General `statistics-path` may give an impression that it is a folder or multiple files. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317418618


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |

Review Comment:
   As I noted above, partition spec output needs to be well defined here. Is it for the associated snapshot or is it the full unified partition spec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317416584


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   The partition spec here probably needs to be specified? All statistics are stored based on the current partition spec? or are we storing based on the unified spec as of that current snapshot. 
   
   For example say i have a table with 2 Specs
   1. Identity A
   2. Identity B
   
   And the current spec is "B"
   
   Are we storing tuples
   (A, B)
   
   or are we storing only
   (B) 
   
   In the statistics file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317398880


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |

Review Comment:
   Why do we need this here? Shouldn't it be stored with the Snapshot? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317132467


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |

Review Comment:
   If we share the fields with partitions table, we need to fill them in both the cases. Else it will be confusing. 
   
   I let go of the idea of keeping the same schema now. 
   I have just kept accurate count as Ryan suggested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317466386


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   The table default format is not defined anywhere else in this doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1789154241

   @aokolnychyi @ajantha-bhat thank guys ! Good one !
   
   As the implementation doesn't seem so complex, I suggest to create the PR including doc. We can start discussion directly in the PR if needed.
   
   Thanks again !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1353203145


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |

Review Comment:
   This makes sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352483805


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 

Review Comment:
   added with some rewording



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #7105:
URL: https://github.com/apache/iceberg/pull/7105


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1829654384

   I am focusing only on the Async write first (spark action and procedure) as discussed last time. 
   
   I see that it has to be done in 4 PRs. 
   
   a) Add a util to read and write partition stats 
   https://github.com/apache/iceberg/pull/9170
   
   b) Track the partition stats file from TableMetadata
   https://github.com/apache/iceberg/pull/8502
   
   c) Spark Action to compute and write the partition stats and registering it to table metadata
   
   d) A call procedure wrapper for spark action. 
   
   First two PRs are ready for review. I am working on last two PRs this week. 
   
   please review and help in merging. 
   cc: @aokolnychyi, @RussellSpitzer, @flyrain   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275213902


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |

Review Comment:
   Duplicate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275253104


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       
+`partition-stats-${snapshotId}.${tableDefaultFormat}`

Review Comment:
   removed. 
   
   Since we mention table metadata file names for filesystem tables and metastore tables, I thought it is ok to mention. 
   But yeah not really needed here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331887869


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |

Review Comment:
   Since nothing is concrete on that side, I didn't add it. 
   If required, we can add these in future as the optional fields. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1709898544

   @ajantha-bhat i think that makes sense to me, discussion is here for reference:  https://github.com/apache/iceberg/pull/7105#discussion_r1318005645  Yea let's see if there is consensus.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319945486


##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   Nit: can we simplify to just
   
   `These rows must be sorted (in ascending manner with NULL FIRST) by partition to optimize...` ?



##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:

Review Comment:
   Nit: does not make too much sense, does this suffice?
   
   `Partition statistics files contain a struct `partition-statistics' with the following fields`



##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,

Review Comment:
   Nit: I am not too sure these two sentences add much value, it is the case for any file reference in Iceberg , isnt it?
   
     
   ```A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, it must be registered in the table metadata file to be considered as a valid statistics file for the reader.```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319802967


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   ok. Changed to 
   
   `Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332566163


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I already have a POC PR, which read and write these files. 
   https://github.com/apache/iceberg/pull/8488/
   
   I think unified tuple will be good for updates. If we keep spec-specific tuple, the stats of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1247798404


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   Added `last_updated_ms` and `last_updated_snapshot_id` similar to what is there in partitions metadata table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1557556981

   @ajantha-bhat, I think it's safe to move forward with the next steps and assume we'll go with one file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1600573739

   Do we need to conclude anything else for this PR? or it can be developed incrementally after this PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317130106


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   done. Only the accurate count is kept now. Removed delete file stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317466386


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   The table default format is not defined anywhere else in this doc. It also would be confusing as we have multiple things called "format" in this document and we are specifically talking about a file format here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317486727


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   This needs to be in the spec then, I think we also probably need to define how the unified partition expression is constructed in the spec. How else will we know the tuple schema?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1318005645


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   > Partitions metadata table is also an async call w.r.t write. But we still don't apply the actual delete files.
   @szehon-ho: Do you have any suggestions?
   
   I still not very convinced that Partitions metadata table should have that column, as it would really reduce the performance of that table.  It's a metadata table, so its sync in the sense that it needs to be calculated when the user queries it.
   
   I slightly prefer having more stats than less. ie total_record_count, data_file_record_count , eq_delete_record_count, pos_delete_record_count.  Internally we can definitely make use of delete record count in planning, for example, like potentially whether to cache deletes, etc.  Users can also use them as @flyrain mentioned.  It is true we already can access all file record counts via manifest entries, so maybe an argument can be made against having them here.  But I think as they are cheap to get, don't see a strong reason why to hide this from the user.
   
   The total_record_count seems the most valuable but expensive to compute.  Also agree with @flyrain  that total count may not be available in writer, for example row-level operation writers , not sure how the writer will know how many total data records are in partition after the write..?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1758132685

   Sorry, we didn't get to discussing this during the sync. Shall we do a separate sync to talk about this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136024456


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   how do we draw the line between what is optional and what is not? 
   
   I see your have the following phases in the design:
   
   > Phase2 implementation can be about adding more fields to the partition stats like latest sequence number for the partition, file_size_in_bytes and other metrics from the data_file spec (https://iceberg.apache.org/spec/#manifests) to support special use cases for the users.
   > 
   > Phase3 implementation can be for writing partition level puffin files and adding the puffin file location to the schema of partition stats. 
   
   Do we still plan to introduce puffin files to this? Based on the discussion in the doc, it seems to have scalability concerns which I agree. So if we want to add at least NDV and bloom filter, feels like those needs to be optional stats directly in the partition stats file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136560951


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   I might still have some gaps in understanding sequence numbers fully.
   
   But I see that Puffin also saves it. So, I followed the same.
   https://github.com/apache/iceberg/blob/master/format/spec.md#table-statistics 
   
   Do you have any suggestions for this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1223178592


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   If we apply the deletes, maybe the equality_delete and position_delete fields can be empty? If we don't apply the deletes (in some case where we complete stats synchronously during the write in future) it can have the actual delete stats. So, the current schema can still hold good. 
   
   Partitions metadata table is also an async call w.r.t write. But we still don't apply the actual delete files. 
   @szehon-ho: Do you have any suggestions?  
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1224846377


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.

Review Comment:
   I think this should be more clear. There should be one and only one row per unique partition tuple. All partition tuples must be present, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136556742


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |

Review Comment:
   There was also an ask for the `latest sequence number` per partition, the `last modified time` per partition along with `file size`. 
   Also, I remember some ask for lower bounds and upper bounds for other columns per partition. 
   
   These are not must-haves (but good to have). So, the plan is to brainstorm on these in phase 2 implementation as mentioned in the design doc/iceberg sync. 
   
   >  SELECT sum(file_size_in_bytes) FROM db.table.files GROUP By partition
   
   the plan is to incrementally compute these per snapshot. But yeah, manifest's data file entry already has this and we just need to sum this. Let us discuss more on this in phase 2.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1483148663

   @rdblue @danielcweeks any thoughts about the discussion above?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1557569210

   > @ajantha-bhat, I think it's safe to move forward with the next steps and assume we'll go with one file.
   
   Thanks for the reply. I think the PR is ready then. 
   I will do the other PRs as per the above plan. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209509285


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.

Review Comment:
   It's fine to say that partition stats files are informational and it's okay not to read them. But what we really need is a statement of the specific _requirements_ for partition stats files. When producing them, what must be true for them to be reliable information for engines?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275248491


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |

Review Comment:
   yes. description covers it already `using partition field ids for the struct field ids`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275217006


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |

Review Comment:
   I think I commented elsewhere, but we need to decide whether the record count must be accurate or whether it should reflect only the data files. I'm inclined to make it accurate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275212755


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 

Review Comment:
   How does sort work, specifically? Have we defined what it means to sort values or to sort a partition tuple in other places?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331866313


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   We are a bit inconsistent throughout the code with naming `data_files_count` vs `data_file_count`. In quite a bit of cases, we are using plural words (like Action API, spec for manifest lists).
   
   Have we discussed the preferred naming to follow in the future?
   
   <img width="826" alt="image" src="https://github.com/apache/iceberg/assets/6235869/e07c7fdf-12e9-4e7e-9c70-f84f0b3b8d97">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331941327


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I assume a single file would cover all specs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316588192


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain

Review Comment:
   If we add it, maybe  `Each table snapshot may be associated with at most one partition statistic file`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1709488967

   I like the strong case by @flyrain (https://github.com/apache/iceberg/pull/7105#discussion_r1317716670) and @szehon-ho (https://github.com/apache/iceberg/pull/7105#discussion_r1318005645) to retain delete stats. Very good use cases. So, I am adding back the delete stats. (Yeah it is too much back and forth on this 🤦 ). 
   
   I hope @rdblue also agree this time on keeping delete stats. 
   
   Now, about the accurate record count, we can have a new field as `total_record_count` initalized to -1 and will be filled only during the async call (via CALL procedure). The sync call (writers writing them), can skip computing this as we discussed that writers may not have this information and may need IO of all positional and equality delete files which is expensive. 
   
   Let me know what everyone thinks on this.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317422658


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   Seeing this I assume we are doing one entry for every unique partition tuple in the table. We should probably write that down specifying that above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317486727


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   This needs to be in the spec then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1320008836


##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:

Review Comment:
   I copied from puffin statistics file statements few lines above. 
   
   changed it to
   
   `partition-statistics` field of table metadata is an optional list of struct with the following fields:
   



##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,

Review Comment:
   I have shortened it a bit. Even though it seems implicit, It links back to how it is tracked and when it is valid. I remember getting some comment to add this statement. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135989290


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |

Review Comment:
   Are we going to add file size? Although it is highly related to the record count, it is still quite useful to help understand the storage size of a partition. Besides, it is easy to get, a command like this will work.
   ```
   SELECT sum(file_size_in_bytes) FROM db.table.files GROUP By partition
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1540274093

   Hi @aokolnychyi and @rdblue, 
   
   I just created a single parquet file with a million sorted unique timestamp (partition value) and the same schema as expected partition stats. The file size turned out to be 12.6MB, Unsorted random file was around 30MB.
   
   https://gist.github.com/ajantha-bhat/3401f2a29ddfa6b2b42f9168461ce98b 
   
   I agree that rewriting the whole file is expensive once the data grows. But creating multiple small files for incremental updates will not sort whole data and may slow the look-up of partition values during query. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1238082542


##########
format/spec.md:
##########
@@ -702,6 +703,48 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       

Review Comment:
   'may' as it is optional for the writers to write the partition stats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209508378


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  

Review Comment:
   I recommend following the same structure that the Iceberg spec uses. Those use named types, like `manifest_file` and then define those structs. Here's a snippet:
   
   > Manifest list files store `manifest_file`, a struct with the following fields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1614601062

   I think the PR is ready. I even added `total_data_file_size_in_bytes `, `last_updated_ms`, `last_updated_snapshot_id` for the schema now as many people were asking for it. 
   
   Please help me review and merge, It has been quite some time now since this PR.
   
   cc: @szehon-ho, @RussellSpitzer, @rdblue, @jackye1995    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1711585128

   @RussellSpitzer, @flyrain, @szehon-ho, @rdblue: I have addressed the new suggestions. Please approve the PR if it is ok or comment more if we need further changes. Thanks.    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1320009356


##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1322057077


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |

Review Comment:
   How about removing `max-data-sequence-number` temporarily, so that we can move on this PR? We can get a sequence number from a snapshot without any issue. And we can always add the `max-data-sequence-number` back if necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1322272027


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |

Review Comment:
   Done. Removed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1324102467


##########
format/spec.md:
##########
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table. 

Review Comment:
   Elaborated the description and added the examples. 
   
   During implementation I will be using existing `Partitioning#partitionType()` for this. 
   But Spec should not talk about code. Hence, excluded that info. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1758136834

   @aokolnychyi sure, no problem to have a specific meeting about that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352490767


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   We can discuss the interest from the community for the synchronous writes. 
   Some of them might be intersted. 
   
   Agree that we should first go with async implementation to make things easier. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1351072100


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488, we won't have to touch every operation type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1351007189


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file

Review Comment:
   Question: Shall `file` be `File` since it is a header?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 

Review Comment:
   Question: Is there a particular reason to use capital letters for `Partition Statistics`? It seems inconsistent with other places.



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |

Review Comment:
   I see your point but isn't it also a path to a single file in case of table stats? I would align the naming to be consistent if it indicates a path to a single file in both cases.



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 

Review Comment:
   Shall we add a note that it can be also computed on demand rather that in each write?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |

Review Comment:
   Am I right this would only be possible to compute by reading data and applying deletes? If so, are we planning to make this optional and not populate by default?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |

Review Comment:
   This seems reasonable, we can add it later.



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).

Review Comment:
   I am not sure it is a good idea for the spec to make this assumption. I can see this being configurable even for tables that store Avro data to use Parquet or ORC for partition stats. Why can't we just default this in the implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1351072100


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I took a look at #8488 and I am not sure I how feel about generating these files during commits (in fact, during each commit attempt). I'd personally start by adding API and core logic to be able to add these files on demand and implement an action to actually produce these files (either incrementally or from scratch). Updating these stats for larger tables in each commit attempt will cause issues. In the action, we can do this more efficiently. We can also call this action immediately after writes but at least it will not be part of the commit. It would also drastically reduce the amount of changes in #8488.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275253774


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |

Review Comment:
   linking that thread: https://github.com/apache/iceberg/pull/7105#discussion_r1209511756
   
   we need conclusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1201614523


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   I really wanted to handle these in phase 2 as mentioned in the design doc, if there is a popular demand, I can add it in this PR after #7581 is merged. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209752277


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               

Review Comment:
   do you think the file size of delete files will also be useful or just the data files are enough? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1482268557

   > This makes me feel if we should really use a storage for this. Technically it sound like we should use a key-value store for (partition, spec-id) -> (statistics values) instead of a file-based storage.
   Maybe we should build an interface for it and just make storage a specific implementation, instead of saying it has to be backed by a file or multiple files, because it is has to be inefficient in read or write in that way. Just a thought.
   
   During the design phase, I did evaluate storing it in a KV store (like rocksdb or hbase index).  It is also captured in the design document. 
   But we concluded that we don't want to bring external dependencies into Iceberg core functions and also we cannot use the current clean-up logic with this. Hence, the File is chosen. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1493696292

   > I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.
   
   Hi @aokolnychyi, any update on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1155503638


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   https://github.com/apache/iceberg/pull/7105#discussion_r1136556742



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1137402901


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   thought about it a bit more last night, it should be fine for this use case, because the same sequence number means the data might have been replaced but the logical content remains the same, so we should be good here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136015638


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |

Review Comment:
   +1, also for delete files



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1238093503


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.

Review Comment:
   I have added now. 
   
   Initially didn't add thinking it is implicit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1223171392


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to

Review Comment:
   yeah it is not a file format. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1219531635


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information
+for every partition value as a row in the **table default format** sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   > The schema of PartitionData is based on specific partition spec
   
   It includes all the fields from all the specs. 
   https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionsTable.java#L50
   
   more info on the schema
   https://github.com/apache/iceberg/blob/cfa090531e955911e792e24f3d14103c69a63c63/core/src/main/java/org/apache/iceberg/Partitioning.java#L245
   
   > And we don't nedd a spec_id field
   
   in the case of partition evolution, there can be partitions based on old and new spec. In that case, spec_id field can be helpful to know these partition stats belong to which spec. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352528356


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   Also, Trino is currently writing Puffin in both sync and async way. Dremio is also intersted in sync stats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1708158955

   @rdblue, @szehon-ho, @flyrain, @RussellSpitzer, @jackye1995 : PR is ready. I finally let go of the idea of sharing a schema and class with the partitions metadata table  and addressed all the comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1758143237

   > Sorry, we didn't get to discussing this during the sync. Shall we do a separate sync to talk about this?
   
   This is not the first time this happened. From past few community sync, partition stats was always in the topic of discussion. But we fail to cover it due to time restriction and also looks like we don't follow the addition order of topics. 
   
   Anyways, I started discussing this in the mailing list now. If people still feel a sync is required, I am happy to arrange one.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275214680


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 last_updated_at`** | `timestamptz` | Commit time of snapshot that last updated this partition |
+| _optional_ | _optional_ | **`11 last_updated_snapshot_id`** | `long` | Id of snapshot that last updated this partition |

Review Comment:
   ID should be all caps in documentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319801698


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to

Review Comment:
   Simplified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332566163


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I already have a POC PR, which read and write these files. 
   https://github.com/apache/iceberg/pull/8488/
   
   I think unified tuple will be good for updates. If we keep spec-specific tuple, the values of same partition (after spec evolution) will be distributed to multiple buckets and hard for the reader. Even existing partitions metadata table also uses the same unified partition tuple



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332171774


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I think the spec ID is required to reconstruct the actual partition tuple, if needed. The main question is whether it is easier to work with a unified tuple or a spec-specific tuple. If most use cases need a spec-specific tuple and would require a projection, nothing prevents us from having a file per spec and annotating each partition stats file with a spec ID instead of persisting it for each record. 
   
   Can we think through our initial plans for writing and reading these files? Doesn't have to be very elaborate at this point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1324807607


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long`   | 	Size of the partition statistics file. |

Review Comment:
   I have also added this field today. It can help in avoiding one IO to get the file size to reach parquet footer. Puffin also has this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1323521415


##########
format/spec.md:
##########
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table. 

Review Comment:
   This is not specified enough imo. What does this look like, does order matter, etc ...
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1337643158


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   I'll take a look at the PoC PR as soon as 1.4 is out (within next few days).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1210054437


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |

Review Comment:
   added `file_size_in_bytes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209641245


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format

Review Comment:
   ok. Puffin had a separate doc. So, I added it as a separate doc. 
   
   I will move to Iceberg spec itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209511756


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   Why include equality/position delete counts rather than requiring an accurate record count? I don't have an opinion yet either way, but we could require accurate counts here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209512352


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  

Review Comment:
   The fields will also need to have sequence numbers assigned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "zhongyujiang (via GitHub)" <gi...@apache.org>.
zhongyujiang commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1220695681


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information
+for every partition value as a row in the **table default format** sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   Thanks for explaining, I previously thought this is the same as `PartitionData` in `DataFile`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1237749434


##########
format/spec.md:
##########
@@ -702,6 +703,48 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       

Review Comment:
   may? Or must?



##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.

Review Comment:
   I think what we need to add is that the stats must be accurate if provided, otherwise it will be hard to depend on this information.



##########
format/spec.md:
##########
@@ -702,6 +703,48 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 

Review Comment:
   why do we want to bound the stats file format to the default format? Won't it be better to have a table property to control that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319799186


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   added a detailed note.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1726253342

   Hi @aokolnychyi, @rdblue, can you take a look at the spec change? If it looks good, we'll move ahead with the merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331897463


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   Yes. It is applicable to `data_record_count` and other fields also. 
   Agree that we need to standardise this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jbonofre commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1324098073


##########
format/spec.md:
##########
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table. 

Review Comment:
   @RussellSpitzer you mean at documentation level right (in the `spec.md`), right ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1482319205

   > I guess there is the same argument also for things like manifest list, today rolling up manifest list is a bottleneck for write operations, and some kind of design backed by a key-value store could solve that bottleneck. Maybe we should think about that and try to solve these cases together? Just like we have FileIO that works really well with object storage semantics, we can have something like VersionedListStore that works well with any mutable but versioned list.
   
   Yes. I totally agree that storing metadata in files can be a bottleneck (applies to stats as well). I think now catalogs are maturing enough to store metadata in the DB (REST, maybe Nessie in future).  But the migration from one catalog to another needs special handling in these cases.  
   
   We should separately discuss this and handle it together for all the metadata. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316589084


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first partition column from `partition`
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.

Review Comment:
   Not sure I understand this sentence, did we mean: 'ensuring all partition tuples are present.' => 'ensuring that statistics for all partitions are present'?



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first partition column from `partition`

Review Comment:
   Did we consider sorting by entire partition struct?  (Why sort only by first column?)



##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |

Review Comment:
   Can we keep existing fields to match partition table, and then add a record_count field that is the precomputed accurate value?



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first partition column from `partition`
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:

Review Comment:
   Grammar: "Partition statistics files stores statistics" or "A partition statistics file stores statistics..."



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain

Review Comment:
   If we add it:  `Each table snapshot may be associated with at most one partition statistic file`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317131352


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 last_updated_at`** | `timestamptz` | Commit time of snapshot that last updated this partition |

Review Comment:
   I let go of the idea of making schema to be same as partitions table. 
   Changed to 'long' which is a timestamp from unix epoch. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317406916


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   Are we requiring that these are parquet files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275251533


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 

Review Comment:
   added more detail. 
   
   For delete files, Spec already mentions that file has to be sorted by `file_path` and `pos`. So, referred from there. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1137424817


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   ack



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Docs: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1468398129

   cc: @rdblue, @aokolnychyi, @RussellSpitzer, @flyrain, @jackye1995, @szehon-ho     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352482149


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).

Review Comment:
   Russell gave a comment to explicitly mention the format type. 
   
   I have removed the "default" word and reworded a bit. Implementation can take a call whether to use the default table's format or the one specified in a table property. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135903008


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   Thinking more about it. Can schema be still required and values can be 0 for these in v1? Because I am not sure whether we need to define schema based on the format version. 
   
   Let us wait for some more people's opinions on this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1222196508


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to

Review Comment:
   Looks like this copied from Puffin. Puffin is a file format, but the partition stats files are not. Those are Parquet with a specific schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1221840946


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   > I think computing the actual count can be a very expensive operation.
   
   I think that's the point. Maybe we should have more accurate data since this can be computed async.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1210532040


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information
+for every partition value as a row in the **table default format** sorted based on the first partition column from `partition`.

Review Comment:
   I mentioned sort based on the first partition column because we don't support sort order on struct column. 
   
   If the sort based on first column is not enough, may be I can have an implementation to add sort order for each children of struct. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1210054972


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   changed to _max-data-sequence-number_



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1568168654

   @rdblue: I have addressed the comments. Please take a look at it again. Thanks.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352492103


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |

Review Comment:
   ok. Updated as suggested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "jbonofre (via GitHub)" <gi...@apache.org>.
jbonofre commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1775210294

   I took a new complete look on this PR. @ajantha-bhat isolated specs change in this PR, which is a good approach (to focus only on specs and not the impact on impl).
   As this PR:
   - focus on table/partition spec
   - new table/partition spec properties are optional
   I think it's reasonable to merge it (or at least to have a new review).
   
   @rdblue @aokolnychyi do you mind to take a new look ? IMHO, it's good for me and we can merge it. The impl/engine changes will be in other PRs (eventually iterating on spec change if needed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1225000209


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1224846778


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       
+`partition-stats-${snapshotId}.${tableDefaultFormat}`

Review Comment:
   Are there other examples of naming requirements?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275247470


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 last_updated_at`** | `timestamptz` | Commit time of snapshot that last updated this partition |

Review Comment:
   This is to match the schema to be same as partitions metadata table. So, some code can be reused. 
   https://github.com/apache/iceberg/blob/d28f899c802cb3b7b6bd4d6c44f6adf3a20a1bd9/core/src/main/java/org/apache/iceberg/PartitionsTable.java#L84



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209724616


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  

Review Comment:
   Got it. I initially referred to tables in puffin stats spec and improvised on top of it. 
   I will update it as the tables in spec.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316466449


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first partition column from `partition`
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |

Review Comment:
   Should we emphasize that the `data_record_count` doesn't provide an exact record count in the presence of delete files? Without this clarification, readers might easily assume it represents the definitive record count for the partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331866313


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   We are a bit inconsistent throughout the code with naming `data_files_count` vs `data_file_count`. In quite a bit of cases, we are using plural words (like Action API, spec for persisted manifest lists). Have we discussed the preferred naming to follow in the future?
   
   <img width="826" alt="image" src="https://github.com/apache/iceberg/assets/6235869/e07c7fdf-12e9-4e7e-9c70-f84f0b3b8d97">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331948730


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   yes, the partition tuple is a unified type. Hence, it is a coerced result from all the specs. 
   This spec id is just incase if we want to know the latest spec that has modified this partition. 
   
   Do you feel it is redundant and we can remove it? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317477618


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |

Review Comment:
   It is full unified spec computed from `Partitioning#partitionType`. We coerce the partition from previous partition stats file while writing the stats for the current snapshot to support smooth partition evolution.
   
   https://github.com/apache/iceberg/blob/2b189759a902cdd8ad433d0c1643376b78ec867c/core/src/main/java/org/apache/iceberg/Partitioning.java#L242



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317696525


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   We would need to read all of the valid positional delete files to know the number of unique deletes for a data file. We wouldn't need to scan data files though, so that's nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317716670


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   An additional scan on all positional delete files of partition changed are needed. This may still be expensive for sync operations when a write touch a lot of partitions. 
   
   Admittedly, the precise actual record count is useful. I still think the separated metrics(row count, file size) for data files, delete files are valuable. Users could use these metrics to decide if a compaction is needed, or debug why certain scan is slow. For example, if we see the there are several delete files in a partition, we may compact it for better scan perf. With that, I propose to keep them instead of removing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1789144339

   Thanks everyone involved! Thanks for pushing this, @ajantha-bhat!
   
   > I will first breakdown the steps/PRs needed for on demand generation in doc or in dev slack channel.
   Whatever works for you. If the implementation is simple enough, you may just go ahead and create a PR. If you feel there are a few items to be discussed first, please share the design doc to the dev list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275213440


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |

Review Comment:
   What are the field IDs for the struct fields? They match the partition spec, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135985618


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   I don't have strong opinion on this. It appears to be the same from user's perspective, while optional columns help reduce storage space. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135890688


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   Should these be optional since v1 tables do have them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136015044


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   compaction can write to starting sequence number, would it mess up with the usage of this value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1225000154


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+
+Each snapshot may create a new partition statistics file. The name of the partition statistics file is as follows:       
+`partition-stats-${snapshotId}.${tableDefaultFormat}`

Review Comment:
   Added an example



##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition value is stored as a row in the **table default format** 
+sorted based on the first partition column from `partition`.

Review Comment:
   Updated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1210056041


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.

Review Comment:
   Added some points. Please suggest if anything else needs to be added. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319942204


##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,

Review Comment:
   Just my opinion, I am not too sure these two sentences add much value, it is the case for any file reference in Iceberg , isnt it?
   
     
   ```A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, it must be registered in the table metadata file to be considered as a valid statistics file for the reader.```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317133916


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first partition column from `partition`

Review Comment:
   Complex type (struct) sorting is not supported.
   But we can form a sort order with individual children of the struct. 
   
   I changed it back to sort by entire struct now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317409889


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   The first line (L 724) mentions it as a **table default format**.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1318005645


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   > Partitions metadata table is also an async call w.r.t write. But we still don't apply the actual delete files.
   @szehon-ho: Do you have any suggestions?
   
   I still not very convinced that Partitions metadata table should have that column, as it would really reduce the performance of that table.  It's a metadata table, so its sync in the sense that it needs to be calculated when the user queries it.
   
   I slightly prefer having more stats than less. ie total_record_count, data_file_record_count , eq_delete_record_count, pos_delete_record_count.  Internally we can definitely make use of delete record count in planning, for example, like potentially whether to cache deletes, etc.  Users can also use them as @flyrain mentioned.  It is true we already can access all file record counts via manifest entries, so maybe an argument can be made against having them here.  But I think as they are cheap to get, don't see a strong reason why to hide this from the user in partition stats.
   
   The total_record_count seems the most valuable but expensive to compute.  Also agree with @flyrain  that total count may not be available in writer, for example row-level operation writers , not sure how the writer will know how many total data records are in partition after the write..?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316339821


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain

Review Comment:
   Can we be more specific by saying?
   ```
   Each table snapshot will have either a single partition statistic file or none at all.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331845491


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |

Review Comment:
   It looks like we call it `statistics-path` in table stats.
   Is there a particular reason to call it `statistics-file-path` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1331843365


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |

Review Comment:
   Does this need to account for encryption and store `key-metadata`? Just asking, no need to add it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209507462


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format

Review Comment:
   For other metadata files, we don't have separate spec documents. Could you incorporate this into the Iceberg spec?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209508695


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|

Review Comment:
   No need for the extra whitespace on these lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209641992


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   I think computing the actual count can be a very expensive operation. That too when equality deletes are involved. 
   I have followed the same schema that I have implemented for  `Partitions` metadata table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136597924


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   I have updated the optional fields now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1482226297

   > We may end up supporting both copy-on-write (single output file) and merge-on-read (multiple output files) as it boils down to whether the user wants faster read time or writing time.
   
   This makes me feel if we should really use a storage for this. Technically it sound like we should use a key-value store for `(partition, spec-id) -> (statistics values)` instead of a file-based storage. 
   
   Maybe we should build an interface for it and just make storage a specific implementation, instead of saying it has to be backed by a file or multiple files, because it is has to be inefficient in read or write in that way. Just a thought.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352485101


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 

Review Comment:
   True, changed to keep capital only for headers. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1756079392

   I added this PR to our community sync. I am not sure I will be there this week but I'll sync with Russell and Yufei afterwards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352500873


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file

Review Comment:
   yes. updated. 
   
   Also updated the header of Table statistics -> Table Statistics 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352477202


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |

Review Comment:
   You are right. That is why schema is kept optional. 
   
   Implementation will not populate this by default (can be controlled by a property or the way of writing. For example, async write can compute it but not the incremental sync writes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1275214268


##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 last_updated_at`** | `timestamptz` | Commit time of snapshot that last updated this partition |

Review Comment:
   In other places, we use a timestamp in milliseconds from unix epoch internally. Why not use that here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #7105: Spec: Add partition stats spec

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1470860750

   I am still yet to perform that test to check if 1 file is sufficient. I also worry about the cost of incremental updates if we have to replace the entire file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1136558884


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   Partition stats itself is an optional requirement for the reader. 
   I believe field id 1 to 4 can be required and all others can be optional. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209513382


##########
format/spec.md:
##########
@@ -702,6 +703,21 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition statistics spec](../partition-statistics-spec). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name                 | Type     | Description                                                                                           |
+|----|----|----------------------------|----------|-------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**          | `long`   | ID of the Iceberg table's snapshot the partition statistics file is associated with.                  |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition statistics spec](../partition-statistics-spec). |
+| _required_ | _required_ | **`sequence-number`**      | `long`   | Sequence number of the Iceberg table's snapshot the partition statistics was computed from.           |

Review Comment:
   We always assign a new sequence number, even if we are compacting and setting the data sequence number for new files. So there's some nuance here.
   
   We could use something like "max data sequence number". That would give us an idea when these are equivalent. If a given snapshot's max data sequence number is equal to a known file, the stats should be identical.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209508531


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |

Review Comment:
   Please follow the structure of tables used by the Iceberg spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1209642134


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               

Review Comment:
   Yeah. we can have it. I will add it.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "zhongyujiang (via GitHub)" <gi...@apache.org>.
zhongyujiang commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1219331736


##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics file format](#partition-statistics-file-format). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information
+for every partition value as a row in the **table default format** sorted based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |

Review Comment:
   The schema of `PartitionData` is based on specific partition spec, so I think this should declare that a partition stats file can only store partition stats for a single partition spec, right? Just like the manifest [spec](https://iceberg.apache.org/spec/#manifests ).  And we don't nedd a `spec_id` field, we only need to  store it in file's meta instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Docs: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1135892887


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name              | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|-------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition               | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                 | IntegerType  | false      | partition spec id                                                                                                           |
+| 3        | data_record_count       | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count         | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | pos_delete_record_count | LongType     | false      | Count of records in position delete files                                                                                   |                                                          
+| 6        | pos_delete_file_count   | IntegerType  | false      | Count of position delete files                                                                                              |                                   
+| 7        | eq_delete_record_count  | LongType     | false      | Count of records in equality delete files                                                                                   |                                  
+| 8        | eq_delete_file_count    | IntegerType  | false      | Count of equality delete files                                                                                              |                                

Review Comment:
   Ah. Good catch. I will update it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7105: Spec: Add partition stats spec

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1482286193

   sorry I totally overlooked that, I would also -1 for using a specific external dependency like RocksDB or HBase, that was probably why I just quickly skipped those options... 
   
   But I feel the semantics required for partition stats just does not fit a file storage system, as you said it ends up having to choose between CoW and MoR, which seems like too much complexity to just manage some additional stats.
   
   I think we can start from a file storage (FileIO) based solution, but the spec should be at higher level such that it could be backed by more efficient solutions.
   
   I guess there is the same argument also for things like manifest list, today rolling up manifest list is a bottleneck for write operations, and some kind of design backed by a key-value store could solve that bottleneck. Maybe we should think about that and try to solve these cases together? Just like we have `FileIO` that works really well with object storage semantics, we can have something like `VersionedListStore` that works well with any mutable but versioned list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317409889


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:

Review Comment:
   The first line mentions it as a **table default format**.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317477618


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present.
+
+A partition statistics file stores statistics as a struct with the following fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |

Review Comment:
   It is full unified spec computed from `Partitioning#partitionType` 
   
   https://github.com/apache/iceberg/blob/2b189759a902cdd8ad433d0c1643376b78ec867c/core/src/main/java/org/apache/iceberg/Partitioning.java#L242



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317479283


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the **table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order

Review Comment:
   (A,B)
   
   It is full unified spec computed from `Partitioning#partitionType`. We coerce the partition from previous partition stats file while writing the stats for the current snapshot to support smooth partition evolution.
   
   https://github.com/apache/iceberg/blob/2b189759a902cdd8ad433d0c1643376b78ec867c/core/src/main/java/org/apache/iceberg/Partitioning.java#L242



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317695173


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   > Writers that produce positional deletes would be able to produce this count directly during writes.
   
   This may not be true. For example, for a data file `df1`, we could have two positional delete files(`pdf1`, `pdf2`) applying to it. `pdf1` and `pdf2` may delete the same row in `df1`. The writer may not know if a row has been deleted previously.  Not 100% sure though. Needs inputs from @szehon-ho @aokolnychyi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316339821


##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not required to read the table correctly. A table can contain

Review Comment:
   Can we be more specific by saying?
   ```
   Each table snapshot may have either a single partition statistic file or none at all.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316798888


##########
format/partition-statistics-spec.md:
##########
@@ -0,0 +1,43 @@
+---
+title: "Partition Statistics Spec"
+url: partition-statistics-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to store statistics information 
+for every partition value as a row in the **table default format** sorted based on the `partition`.
+
+The schema of the file is as follows:
+
+| Field Id | Field Name                   | Field Type   | isOptional | Doc                                                                                                                         |
+|----------|------------------------------|--------------|------------|-----------------------------------------------------------------------------------------------------------------------------|
+| 1        | partition                    | StructType   | false      | See [PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java) |                                                                                  
+| 2        | spec_id                      | IntegerType  | false      | Partition spec id                                                                                                           |
+| 3        | data_record_count            | LongType     | false      | Count of records in data files                                                                                              |                                               
+| 4        | data_file_count              | IntegerType  | false      | Count of data files                                                                                                         |
+| 5        | position_delete_record_count | LongType     | true       | Count of records in position delete files                                                                                   |                                                          
+| 6        | position_delete_file_count   | IntegerType  | true       | Count of position delete files                                                                                              |                                   
+| 7        | equality_delete_record_count | LongType     | true       | Count of records in equality delete files                                                                                   |                                  
+| 8        | equality_delete_file_count   | IntegerType  | true       | Count of equality delete files                                                                                              |                                

Review Comment:
   @sgcowell: For record count, if we have it as an actual count, then file count should include positional deletes too? Because query planner needs to know the number of expected IO (based on the file counts)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317389637


##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to

Review Comment:
   I wonder if we can simplify this a bit. I think we end up being a bit repetitive here. I'm wondering if we can just get one sentence per point here.
   
   1. Partition statistics are not required for reading or planning and readers may ignore them.
   2. There is at most one partition statistics file per snapshot
   3. Any snapshot listed in the `snapshots` of the metadata may have a partition statistics file
   4. (?) Partition statistics may not be associated with a snapshot that doesn't exist in the metadata. // Do we want to make this a requirement?
   
   Then we have some optional implementation guidance,
   
   1. Partitions statistics can be added to the metadata during write operations
   
   That's just my feedback, I just am leaning towards a more straightforward description here. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332569127


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, and 
+it must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |

Review Comment:
   Thanks @flyrain, I too found that initially. But after digging a bit more, internet says both are valid. 
   
   So, I decided we can go with anyone. But we just have to standardise it. Maybe need to check what Spark, Hive and other products follow as standard. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1748239349

   @rdblue: Can you also please take a look?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1755391052

   @aokolnychyi: Thanks for the detailed review and also going through the POC PRs. 
   I have addressed all the comments. Please have a look again.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1753540465

   Getting to this today!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1378874945


##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 

Review Comment:
   Are we using a capital letter for `Partition statistics file spec` on purpose?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.

Review Comment:
   Minor: `one partition statistic file` -> `one partition statistics file` to be consistent with other places?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
+Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics File
+
+Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table 
+and sorted by the field ids in ascending order.  
+In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.
+
+For Example,

Review Comment:
   Minor: `For Example` -> `For example`?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
+Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:

Review Comment:
   Minor: `an optional list of struct` -> `an optional list of structs`?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
+Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics File
+
+Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table 
+and sorted by the field ids in ascending order.  
+In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.
+
+For Example,
+1) spec#0 has two fields {field#1, field#2}
+and then the table has evolved into spec#1 which has three fields {field#1, field#2, field#3}.
+The unified partition type looks like Struct<field#1, field#2, field#3>

Review Comment:
   Minor: Missing `.` at the end?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
+Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics File
+
+Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table 
+and sorted by the field ids in ascending order.  
+In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.
+
+For Example,
+1) spec#0 has two fields {field#1, field#2}
+and then the table has evolved into spec#1 which has three fields {field#1, field#2, field#3}.
+The unified partition type looks like Struct<field#1, field#2, field#3>
+
+2) spec#0 has two fields {field#1, field#2}
+and then the table has evolved into spec#1 which has just one field {field#2}.
+The unified partition type looks like Struct<field#1, field#2>

Review Comment:
   Minor: Missing `.` at the end?



##########
format/spec.md:
##########
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
 
 
+#### Partition Statistics
+
+Partition statistics files are based on [Partition statistics file spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may ignore them.
+Each table snapshot may be associated with at most one partition statistic file.
+A writer can optionally write the partition statistics file during each write operation, or it can also be computed on demand.
+Partition statistics file must be registered in the table metadata file to be considered as a valid statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
+
+#### Partition Statistics File
+
+Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever been a part of any spec in the table 
+and sorted by the field ids in ascending order.  
+In other words, the struct fields represent a union of all known partition fields sorted in ascending order by the field ids.

Review Comment:
   Question: Is this new line intentional? Seems like it all belongs to the same paragraph.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Add partition stats spec [iceberg]

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1789103695

   @aokolnychyi: I have addressed the nits and pushed. PR is ready. 
   
   > I think the next step would be to create an action to generate these files on demand. Afterwards, there would be a lot more clarity on how to do this synchronously. Doing the action first would not require much changes to the core library, so we will get something working faster. I'll help reviewing.
   
   Totally agree.  Synchronous writing can be the last step. 
   I will first breakdown the steps/PRs needed for on demand generation in doc or in dev slack channel. Lets us continue there. Thanks.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org