You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/24 17:27:28 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6267: Core: Update StatisticsFile interface in TableMetadata spec

ajantha-bhat opened a new pull request, #6267:
URL: https://github.com/apache/iceberg/pull/6267

   Note that this is a DRAFT PR, 
   I just wanted to use it for discussions. 
   Once we agree on the changes. I can optimize and add the test cases. 
   
   Background:
   1. We have a plan to reuse the statistics file between two snapshots in case of rewrite data files. 
   But there is no interface to get the current statistics file for the current snapshot. 
   [Spec](https://github.com/apache/iceberg/blob/master/format/spec.md#table-statistics) has a snapshot id in two places, one in `StatisticsFile` and another in its `blob metadata`. 
   To support the reuse of statistics files, we should have the referenced snapshot id in `StatisticsFile`, not the computed-from snapshot id. Hence, updated the spec. 
   
   Note that PR https://github.com/apache/iceberg/pull/6090 is stuck because of confusion around stats file reuse. 
   
   2. Added an interface to get the current statistics file for the current snapshot. This can return null when the stats are not written by the writer for the latest snapshot. 
   
   3. Added an interface to get the checkpoint-snapshot-id for the table. It can return a snapshot id for which the stats file was successfully written. It can return -1 if none of the snapshots has a stats file written. 
   
   Later when we introduce an async way of generating stats using ANALYZE TABLE or CALL procedure. 
   Stats generation can use inputs from these APIs to know whether to compute stats from the beginning or from the checkpoint.
   
   cc: @rdblue, @findepi  
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6267: Docs: Update spec about statistics file snapshot id

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1347529570

   @rdblue: Thanks for the review and suggestions. I have kept it as just the document (spec) update now.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1032019350


##########
format/spec.md:
##########
@@ -684,7 +684,7 @@ Statistics files metadata within `statistics` table metadata field is a struct w
 
 | v1 | v2 | Field name | Type | Description |
 |----|----|------------|------|-------------|
-| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot the statistics were computed from. |
+| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot that is using this statistics file. |

Review Comment:
   below line number 699, has a `snapshot-id` required field in `Blob metadata` which stores the snapshot-id where the stats are actually computed from. 
   
   So, this field in `StatisticsFile` can be the referred snapshot id. 
   So, Incase of rewrite data files, there can be two `StatisticsFile` with two different snapshot id pointing to the same stats file location. This way the expire snapshot can know that the stats file is still referenced. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1032599124


##########
format/spec.md:
##########
@@ -684,7 +684,7 @@ Statistics files metadata within `statistics` table metadata field is a struct w
 
 | v1 | v2 | Field name | Type | Description |
 |----|----|------------|------|-------------|
-| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot the statistics were computed from. |
+| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot that is using this statistics file. |

Review Comment:
   Good wording suggestion. I have updated it.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1038756796


##########
core/src/test/java/org/apache/iceberg/TestTableMetadata.java:
##########
@@ -1274,20 +1275,20 @@ public void testSetStatistics() {
     Assert.assertEquals("Statistics file snapshot", 43L, statisticsFile.snapshotId());
     Assert.assertEquals("Statistics file path", "/some/path/to/stats/file", statisticsFile.path());
 
-    TableMetadata withStatisticsReplaced =
+    TableMetadata withStatisticsAppended =
         TableMetadata.buildFrom(withStatistics)
             .setStatistics(
                 43,
                 new GenericStatisticsFile(
                     43, "/some/path/to/stats/file2", 128, 27, ImmutableList.of()))
             .build();
 
-    Assertions.assertThat(withStatisticsReplaced.statisticsFiles())
-        .as("There should be one statistics file registered")
-        .hasSize(1);
-    statisticsFile = Iterables.getOnlyElement(withStatisticsReplaced.statisticsFiles());
-    Assert.assertEquals("Statistics file snapshot", 43L, statisticsFile.snapshotId());
-    Assert.assertEquals("Statistics file path", "/some/path/to/stats/file2", statisticsFile.path());
+    Assertions.assertThat(withStatisticsAppended.statisticsFiles())
+        .as("There should be two statistics files registered")
+        .hasSize(2)

Review Comment:
   As I fixed the overwrite to append. There will be two stats file for this snapshot id. Hence, updated the testcases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1044051493


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -793,6 +822,18 @@ private static Map<String, SnapshotRef> validateRefs(
     return inputRefs;
   }
 
+  private long computeStatisticsCheckpointSnapshotID() {

Review Comment:
   I didn't consider the branching. True. I will remove it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1043907581


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -793,6 +822,18 @@ private static Map<String, SnapshotRef> validateRefs(
     return inputRefs;
   }
 
+  private long computeStatisticsCheckpointSnapshotID() {
+    if (statisticsFilesById.isEmpty()) {
+      return -1;
+    }
+    for (Snapshot snapshot : Lists.reverse(snapshots)) {

Review Comment:
   Please update this to conform to style guidelines. There should be empty newlines between control flow blocks and the following statements.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1343495749

   @ajantha-bhat, I'm fine with the update to the spec wording, but I don't think the rest of these changes are needed. There should be only one stats file per snapshot in metadata. There is no value in having multiple stats files and implementations should merge them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1038756728


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1183,14 +1222,16 @@ public Builder setStatistics(long snapshotId, StatisticsFile statisticsFile) {
           "snapshotId does not match: %s vs %s",
           snapshotId,
           statisticsFile.snapshotId());
-      statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));

Review Comment:
   The interface supports returning multiple stats file for one snapshot. But was always overwriting instead of appending. So, fixed it.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1326717415

   @rdblue, @findepi: Please let me know your thoughts on this.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1043909685


##########
core/src/test/java/org/apache/iceberg/TestTableMetadata.java:
##########
@@ -1274,20 +1275,20 @@ public void testSetStatistics() {
     Assert.assertEquals("Statistics file snapshot", 43L, statisticsFile.snapshotId());
     Assert.assertEquals("Statistics file path", "/some/path/to/stats/file", statisticsFile.path());
 
-    TableMetadata withStatisticsReplaced =
+    TableMetadata withStatisticsAppended =
         TableMetadata.buildFrom(withStatistics)
             .setStatistics(
                 43,
                 new GenericStatisticsFile(
                     43, "/some/path/to/stats/file2", 128, 27, ImmutableList.of()))
             .build();
 
-    Assertions.assertThat(withStatisticsReplaced.statisticsFiles())
-        .as("There should be one statistics file registered")
-        .hasSize(1);
-    statisticsFile = Iterables.getOnlyElement(withStatisticsReplaced.statisticsFiles());
-    Assert.assertEquals("Statistics file snapshot", 43L, statisticsFile.snapshotId());
-    Assert.assertEquals("Statistics file path", "/some/path/to/stats/file2", statisticsFile.path());
+    Assertions.assertThat(withStatisticsAppended.statisticsFiles())
+        .as("There should be two statistics files registered")
+        .hasSize(2)

Review Comment:
   I think this is incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1043908619


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -793,6 +822,18 @@ private static Map<String, SnapshotRef> validateRefs(
     return inputRefs;
   }
 
+  private long computeStatisticsCheckpointSnapshotID() {

Review Comment:
   This doesn't make sense to me and I suspect it should be removed. There is no guaranteed order to `snapshots` and the snapshots in that list can be from different branches. The first snapshot with a valid stats file has no meaning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1044050182


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1183,14 +1222,16 @@ public Builder setStatistics(long snapshotId, StatisticsFile statisticsFile) {
           "snapshotId does not match: %s vs %s",
           snapshotId,
           statisticsFile.snapshotId());
-      statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));

Review Comment:
   @rdblue, @findepi:  
   It is just as per the previous existing interface in the builder. 
   we had `Map<Long, List<StatisticsFile>> statisticsFiles` in builder. 
   If we wanted one stats file per interface then no need to have list for map value?
   
   Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1347376734

   > I think we still need an interface to get the current stats file for snapshot id currentStatisticsFiles()
   
   No, I don't think this is valuable. This depends too much on some definition of "current", which is about as useful as asking for _any_ stats file. This should be more specific. I agree that we want some way to get a stats file, but I don't think there's much value in the "current" idea.
   
   > It was just as per the previous existing interface in the builder.
   
   I wouldn't worry too much about intermediate representations. What we have in the current API is fine for now. I don't think that we should make the changes in this PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1336131494

   @rdblue, @findepi: PR is ready for review. This should help in unblocking #6090, #6091   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1038756561


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -823,7 +860,7 @@ public static class Builder {
     private long currentSnapshotId;
     private List<Snapshot> snapshots;
     private final Map<String, SnapshotRef> refs;
-    private final Map<Long, List<StatisticsFile>> statisticsFiles;
+    private final Map<Long, List<StatisticsFile>> statisticsFilesById;

Review Comment:
   I thought better to rename this similar to `snapshotsById`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
findepi commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1032538469


##########
format/spec.md:
##########
@@ -684,7 +684,7 @@ Statistics files metadata within `statistics` table metadata field is a struct w
 
 | v1 | v2 | Field name | Type | Description |
 |----|----|------------|------|-------------|
-| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot the statistics were computed from. |
+| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot that is using this statistics file. |

Review Comment:
   "ID of the Iceberg table's snapshot the statistics file is associated with" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1343786472

   > @ajantha-bhat, I'm fine with the update to the spec wording, but I don't think the rest of these changes are needed. 
   
   I think we still need an interface to get the current stats file for snapshot id `currentStatisticsFiles()`
   
   > There should be only one stats file per snapshot in metadata. 
   There is no value in having multiple stats files and implementations should merge them.
   
   Maybe true. But It was just as per the previous existing interface in the builder.
   we had `Map<Long, List<StatisticsFile>> statisticsFiles` in builder.
   If we wanted one stats file per interface then no need to have list for map value?
   
   Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6267: Core: Update StatisticsFile interface in TableMetadata spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#discussion_r1043909381


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1183,14 +1222,16 @@ public Builder setStatistics(long snapshotId, StatisticsFile statisticsFile) {
           "snapshotId does not match: %s vs %s",
           snapshotId,
           statisticsFile.snapshotId());
-      statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));

Review Comment:
   This is not correct. There should be only one stats file for a snapshot that contains all information. Allowing multiple stats files for a snapshot is going to lead to lazy implementations that don't merge the files and slower job planning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6267: Docs: Update spec about statistics file snapshot id

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6267:
URL: https://github.com/apache/iceberg/pull/6267#issuecomment-1356910794

   Thanks, @ajantha-bhat! Merging this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6267: Docs: Update spec about statistics file snapshot id

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6267:
URL: https://github.com/apache/iceberg/pull/6267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org