You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "dramaticlly (via GitHub)" <gi...@apache.org> on 2023/05/10 23:09:03 UTC

[GitHub] [iceberg] dramaticlly opened a new pull request, #7581: Add last updated timestamp and snapshotId for partition table

dramaticlly opened a new pull request, #7581:
URL: https://github.com/apache/iceberg/pull/7581

   close #7560
   
   ## What
   add 2 column to partition metadata table
   - `last_updated_at` with timestamp with timezone type (similar to `committed_at` in snapshot metadata table)
   - `last_updated_snapshot_id` with long type  
   
   ## How
   In Partition metadata table, we moved from iterating over ContentFiles from ManifestReader to ManifestEntry instead, since entry contain both content file and snapshotId, which are valuable to provide last updated timestamp and snapshot information. We group by file for each partition boundary and get the highest snapshot commit timestamp so that we know when exactly is the partition last modified. 
   
   ## Why
   This can be potentially useful for iceberg table ingested by streaming application such as flink, to understand which iceberg partition is "sealed" based on timestamp. However, the write such as data compaction can also move the last update timestamp forward without actual data change, which snapshot id here can be useful for further analysis
   
   @szehon-ho if you can help review your idea :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1237776518


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -142,13 +158,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());
+        ContentFile<?> file = entry.file();

Review Comment:
   I just realized the entry file status can change from added to existing without change its data_file to snapshot-id mapping when following your thought. So this imply that we do not need to filter entry by its status and I also added a new unit test to cover on how rewrite manifests and expire snapshot will impact our last updated timestamp and snapshotId in partitions table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1191649893


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -139,13 +155,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());

Review Comment:
   hey @puchengy , I think you raised a good point. 
   
   Here's my thoughts, the snapshot expiration removes snapshots together with metadata and data files only reachable by such snapshot. I was under the impression that the manifest files (and manifest entries) we are iterating here shall not provide snapshotId which is already expired. 
   
   Let me know what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201717572


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   Hm, what do we think to have both?  (and null if its expired, as per discussion here https://github.com/apache/iceberg/pull/7581#discussion_r1201257740).  If we have it, I feel its more useful than the time, but agree we don't always have it.
   
   I think sequence number will be good too, but do we mean fileSequenceNumber or dataSequenceNumber.  Maybe worth another pr if there's more discussion there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1239123194


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -249,19 +273,31 @@ static class Partition {
     private int posDeleteFileCount;
     private long eqDeleteRecordCount;
     private int eqDeleteFileCount;
+    private Long lastUpdatedMs;
+    private Long lastUpdatedSnapshotId;
 
     Partition(StructLike key, Types.StructType keyType) {
       this.partitionData = toPartitionData(key, keyType);
       this.specId = 0;
-      this.dataRecordCount = 0;
+      this.dataRecordCount = 0L;
       this.dataFileCount = 0;
-      this.posDeleteRecordCount = 0;
+      this.posDeleteRecordCount = 0L;
       this.posDeleteFileCount = 0;
-      this.eqDeleteRecordCount = 0;
+      this.eqDeleteRecordCount = 0L;
       this.eqDeleteFileCount = 0;
+      this.lastUpdatedMs = null;

Review Comment:
   Nit: is it necessary? 



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -72,7 +72,17 @@ public class PartitionsTable extends BaseMetadataTable {
                 8,
                 "equality_delete_file_count",
                 Types.IntegerType.get(),
-                "Count of equality delete files"));
+                "Count of equality delete files"),
+            Types.NestedField.optional(
+                9,
+                "last_updated_ms",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),

Review Comment:
   Nit: Commit time of snapshot that last updated this partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#issuecomment-1605005042

   Merged, thanks @dramaticlly for the great work ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1238927241


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(

Review Comment:
   thank you Szehon, that make sense and I updated to mark both optional and leave its default to null like parent_id in snapshots metadata table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.

ajantha-bhat commented on PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#issuecomment-1721239624

   Yes, It will be in the upcoming 1.4.0 release. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1190473529


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -139,13 +155,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());

Review Comment:
   In the case of snapshots got expired, the return value can be null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1203161846


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   @RussellSpitzer @aokolnychyi @flyrain  any thoughts here, what would make more sense on Partition table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1203161846


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   @RussellSpitzer @aokolnychyi any thoughts here, what would make more sense on Partition table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho merged pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho merged PR #7581:
URL: https://github.com/apache/iceberg/pull/7581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1239134332


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -249,19 +273,31 @@ static class Partition {
     private int posDeleteFileCount;
     private long eqDeleteRecordCount;
     private int eqDeleteFileCount;
+    private Long lastUpdatedMs;
+    private Long lastUpdatedSnapshotId;
 
     Partition(StructLike key, Types.StructType keyType) {
       this.partitionData = toPartitionData(key, keyType);
       this.specId = 0;
-      this.dataRecordCount = 0;
+      this.dataRecordCount = 0L;
       this.dataFileCount = 0;
-      this.posDeleteRecordCount = 0;
+      this.posDeleteRecordCount = 0L;
       this.posDeleteFileCount = 0;
-      this.eqDeleteRecordCount = 0;
+      this.eqDeleteRecordCount = 0L;
       this.eqDeleteFileCount = 0;
+      this.lastUpdatedMs = null;

Review Comment:
   removed as it was default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1240147358


##########
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1283,6 +1296,158 @@ public void testUnpartitionedPartitionsTable() {
     TestHelpers.assertEqualsSafe(expectedSchema, expectedRow, actual.get(0));
   }
 
+  @Test
+  public void testPartitionsTableLastUpdatedSnapshot() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
+    Table table = createTable(tableIdentifier, SCHEMA, SPEC);
+    Table partitionsTable = loadTable(tableIdentifier, "partitions");
+    Dataset<Row> df1 =
+        spark.createDataFrame(
+            Lists.newArrayList(new SimpleRecord(1, "1"), new SimpleRecord(2, "2")),
+            SimpleRecord.class);
+    Dataset<Row> df2 =
+        spark.createDataFrame(Lists.newArrayList(new SimpleRecord(2, "20")), SimpleRecord.class);
+
+    df1.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long firstCommitId = table.currentSnapshot().snapshotId();
+
+    // add a second file
+    df2.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long secondCommitId = table.currentSnapshot().snapshotId();
+
+    // check if rewrite manifest does not override metadata about data file's creating snapshot
+    RewriteManifests.Result rewriteManifestResult =
+        SparkActions.get().rewriteManifests(table).execute();
+    Assert.assertEquals(
+        "rewrite replaced 2 manifests",
+        2,
+        Iterables.size(rewriteManifestResult.rewrittenManifests()));
+    Assert.assertEquals(
+        "rewrite added 1 manifests", 1, Iterables.size(rewriteManifestResult.addedManifests()));
+
+    List<Row> actual =
+        spark
+            .read()
+            .format("iceberg")
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    GenericRecordBuilder builder =
+        new GenericRecordBuilder(AvroSchemaUtil.convert(partitionsTable.schema(), "partitions"));
+    GenericRecordBuilder partitionBuilder =
+        new GenericRecordBuilder(
+            AvroSchemaUtil.convert(
+                partitionsTable.schema().findType("partition").asStructType(), "partition"));
+    List<GenericData.Record> expected = Lists.newArrayList();
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 1).build())
+            .set("record_count", 1L)
+            .set("file_count", 1)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(firstCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", firstCommitId)
+            .build());
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 2).build())
+            .set("record_count", 2L)
+            .set("file_count", 2)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(secondCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", secondCommitId)
+            .build());
+
+    Assert.assertEquals("Partitions table should have two rows", 2, expected.size());
+    Assert.assertEquals("Actual results should have two rows", 2, actual.size());
+    for (int i = 0; i < 2; i += 1) {
+      TestHelpers.assertEqualsSafe(
+          partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
+    }
+
+    // check time travel
+    List<Row> actualAfterFirstCommit =
+        spark
+            .read()
+            .format("iceberg")
+            .option(SparkReadOptions.SNAPSHOT_ID, String.valueOf(secondCommitId))
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    Assert.assertEquals("Actual results should have one row", 2, actualAfterFirstCommit.size());

Review Comment:
   actually I think time travel is not really needed here as method is already very long, let me just drop this section assert against `actualAfterFirstCommit`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1207628850


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   After some thoughts, I think both will be null if the snapshot is expired.  But I think we are no worse off than the alternative, which is to join the entries + snapshots table for the user to find the last update time.  This also will be null if snapshot is expired.
   
   Perhaps we can have some more persistent storage of snapshot metadata , even after expire?
   
   I think for this case (lastUpdateTime, lastSnapshotId), it is ok to proceed, typically snapshots live several days at least, and I would imagine the user of the tool is interested in the last updated partition, and can run it before the snapshot is expired.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1191727852


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -139,13 +155,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());

Review Comment:
   I am not sure either, but we can quickly test this out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1231401197


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",
+                Types.LongType.get(),
+                "Partition last updated snapshot id"),

Review Comment:
   Nit: can we improve the doc?   (Id of snapshot that last updated this partition)



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated",

Review Comment:
   nit: how about 'last_updated_ms'?   (matches Table metadata and also as below, we have a suffix for snapshot_id)



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -142,13 +158,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());
+        ContentFile<?> file = entry.file();

Review Comment:
   This is not done yet, right?  (Something equivalent to ManifestReader.liveEntries())



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201254518


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",

Review Comment:
   How about 'last_updated' to be more precise, and match table metadata json?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -246,19 +268,27 @@ static class Partition {
     private int posDeleteFileCount;
     private long eqDeleteRecordCount;
     private int eqDeleteFileCount;
+    private long lastUpdatedAt;
+    private long lastUpdatedSnapshotId;
 
     Partition(StructLike key, Types.StructType keyType) {
       this.partitionData = toPartitionData(key, keyType);
       this.specId = 0;
-      this.dataRecordCount = 0;
+      this.dataRecordCount = 0L;
       this.dataFileCount = 0;
-      this.posDeleteRecordCount = 0;
+      this.posDeleteRecordCount = 0L;
       this.posDeleteFileCount = 0;
-      this.eqDeleteRecordCount = 0;
+      this.eqDeleteRecordCount = 0L;
       this.eqDeleteFileCount = 0;
+      this.lastUpdatedAt = 0L;
+      this.lastUpdatedSnapshotId = 0L;
     }
 
-    void update(ContentFile<?> file) {
+    void update(ContentFile<?> file, Snapshot snapshot) {
+      if (snapshot.timestampMillis() * 1000 > this.lastUpdatedAt) {

Review Comment:
   opt: maybe make a variable snapshotCommitTime = snapshot.timestampMillis * 1000?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -155,21 +172,26 @@ private static Iterable<Partition> partitions(Table table, StaticTableScan scan)
   }
 
   @VisibleForTesting
-  static CloseableIterable<ContentFile<?>> planFiles(StaticTableScan scan) {
+  static CloseableIterable<ManifestEntry<? extends ContentFile<?>>> planEntries(

Review Comment:
   Is it possible to remove `? extends ContentFile` here?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -155,21 +172,26 @@ private static Iterable<Partition> partitions(Table table, StaticTableScan scan)
   }
 
   @VisibleForTesting
-  static CloseableIterable<ContentFile<?>> planFiles(StaticTableScan scan) {
+  static CloseableIterable<ManifestEntry<? extends ContentFile<?>>> planEntries(
+      StaticTableScan scan) {
     Table table = scan.table();
 
     CloseableIterable<ManifestFile> filteredManifests =
         filteredManifests(scan, table, scan.snapshot().allManifests(table.io()));
 
-    Iterable<CloseableIterable<ContentFile<?>>> tasks =
+    Iterable<CloseableIterable<ManifestEntry<? extends ContentFile<?>>>> tasks =
         CloseableIterable.transform(
             filteredManifests,
             manifest ->
                 CloseableIterable.transform(
                     ManifestFiles.open(manifest, table.io(), table.specs())
                         .caseSensitive(scan.isCaseSensitive())
-                        .select(scanColumns(manifest.content())), // don't select stats columns
-                    t -> (ContentFile<?>) t));
+                        .select(scanColumns(manifest.content())) // don't select stats columns
+                        .entries(),
+                    t ->

Review Comment:
   maybe worth breaking into new method now?  ie, manifest -> entries(manifest, scan)



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -139,13 +155,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());

Review Comment:
   Hm.. i think it may be possible.  Eg, snapshot1 => add DataFile1, snapshot2 => add DataFile2, 
   
   ExpireSnapshot may move Snapshot1, just DataFile1 is not removed.  
   
   Maybe we could make a null here if we don't find a snapshot?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1240121585


##########
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1283,6 +1296,158 @@ public void testUnpartitionedPartitionsTable() {
     TestHelpers.assertEqualsSafe(expectedSchema, expectedRow, actual.get(0));
   }
 
+  @Test
+  public void testPartitionsTableLastUpdatedSnapshot() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
+    Table table = createTable(tableIdentifier, SCHEMA, SPEC);
+    Table partitionsTable = loadTable(tableIdentifier, "partitions");
+    Dataset<Row> df1 =
+        spark.createDataFrame(
+            Lists.newArrayList(new SimpleRecord(1, "1"), new SimpleRecord(2, "2")),
+            SimpleRecord.class);
+    Dataset<Row> df2 =
+        spark.createDataFrame(Lists.newArrayList(new SimpleRecord(2, "20")), SimpleRecord.class);
+
+    df1.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long firstCommitId = table.currentSnapshot().snapshotId();
+
+    // add a second file
+    df2.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long secondCommitId = table.currentSnapshot().snapshotId();
+
+    // check if rewrite manifest does not override metadata about data file's creating snapshot
+    RewriteManifests.Result rewriteManifestResult =
+        SparkActions.get().rewriteManifests(table).execute();
+    Assert.assertEquals(
+        "rewrite replaced 2 manifests",
+        2,
+        Iterables.size(rewriteManifestResult.rewrittenManifests()));
+    Assert.assertEquals(
+        "rewrite added 1 manifests", 1, Iterables.size(rewriteManifestResult.addedManifests()));
+
+    List<Row> actual =
+        spark
+            .read()
+            .format("iceberg")
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    GenericRecordBuilder builder =
+        new GenericRecordBuilder(AvroSchemaUtil.convert(partitionsTable.schema(), "partitions"));
+    GenericRecordBuilder partitionBuilder =
+        new GenericRecordBuilder(
+            AvroSchemaUtil.convert(
+                partitionsTable.schema().findType("partition").asStructType(), "partition"));
+    List<GenericData.Record> expected = Lists.newArrayList();
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 1).build())
+            .set("record_count", 1L)
+            .set("file_count", 1)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(firstCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", firstCommitId)
+            .build());
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 2).build())
+            .set("record_count", 2L)
+            .set("file_count", 2)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(secondCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", secondCommitId)
+            .build());
+
+    Assert.assertEquals("Partitions table should have two rows", 2, expected.size());
+    Assert.assertEquals("Actual results should have two rows", 2, actual.size());
+    for (int i = 0; i < 2; i += 1) {
+      TestHelpers.assertEqualsSafe(
+          partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
+    }
+
+    // check time travel
+    List<Row> actualAfterFirstCommit =
+        spark
+            .read()
+            .format("iceberg")
+            .option(SparkReadOptions.SNAPSHOT_ID, String.valueOf(secondCommitId))
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    Assert.assertEquals("Actual results should have one row", 2, actualAfterFirstCommit.size());

Review Comment:
   Hey , sorry , was doing another pass, is this a problem (size is actually 2 instead of 1)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1207629153


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -142,13 +158,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());
+        ContentFile<?> file = entry.file();

Review Comment:
   Let's check if the entry status is not EXISTING, otherwise its a bit meaningless.  And let's add a test to capture this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1237874259


##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1404,6 +1424,158 @@ public void testPartitionsTable() {
     }
   }
 
+  @Test
+  public void testPartitionsTableLastUpdatedSnapshot() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
+    Table table = createTable(tableIdentifier, SCHEMA, SPEC);
+    Table partitionsTable = loadTable(tableIdentifier, "partitions");
+    Dataset<Row> df1 =
+        spark.createDataFrame(
+            Lists.newArrayList(new SimpleRecord(1, "1"), new SimpleRecord(2, "2")),
+            SimpleRecord.class);
+    Dataset<Row> df2 =
+        spark.createDataFrame(Lists.newArrayList(new SimpleRecord(2, "20")), SimpleRecord.class);
+
+    df1.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long firstCommitId = table.currentSnapshot().snapshotId();
+
+    // add a second file
+    df2.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long secondCommitId = table.currentSnapshot().snapshotId();
+
+    // rewrite manifest carry over the data file to its committing snapshot relationship

Review Comment:
   Minor: How about 'check if rewrite manifest does not override metadata about data file's creating snapshot'  Is it more clear?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(

Review Comment:
   Question: did we consider making it optional, and returning null?  A precedent may be snapshots table parentId field.



##########
core/src/test/java/org/apache/iceberg/MetadataTableScanTestBase.java:
##########
@@ -81,18 +81,20 @@ protected void validateTaskScanResiduals(TableScan scan, boolean ignoreResiduals
   }
 
   protected void validateSingleFieldPartition(
-      CloseableIterable<ContentFile<?>> files, int partitionValue) {
+      CloseableIterable<ManifestEntry<? extends ContentFile<?>>> files, int partitionValue) {

Review Comment:
   Nit: do you think we can remove "extends ContentFile<?>" here as well?  Compiler seems to be able to figure it out without it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1238995687


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.optional(

Review Comment:
   yeah that make sense, especially those 2 columns are optional now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#issuecomment-1603467216

   Looks great, will merge tomorrow if tests pass and to wait any other comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1207628850


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   After some thoughts, I think both will be null if the snapshot is expired.  But I think we are no worse off than the alternative, which is to join the entries + snapshots table for the user to find the last update time.
   
   And I think for this case (lastUpdateTime, lastSnapshotId), it is ok to proceed, typically snapshots live several days at least, and I would imagine the user of the tool is interested in the last updated partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1202686519


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   > a) Is it a good idea to keep the snapshot id? Because regularly running expire_snapshots can clean up the snapshots and we may not be able to map what operation these files were created from, even with the snapshot id.
   > 
   > b) There was also an ask for "latest sequence number" associated with that partition from the community users during partition stats discussion.
   > 
   > Do you think modified time is enough and no need for the sequence number?
   
   My initial thought process is like the last updated timestamp is helpful by itself but if there's doubt around the timestamp, it's better to provide a reference to allow for further investigation. Here we derived last updated timestamp from snapshot, so providing snapshotId enable a way to look up further information about snapshot (if it's a rewrite data operation or is it an append from late arrival data).
   
   With respect to the periodic snapshot expiration, I think partition can have null snapshot based on referenced snapshotId if it was already expired, but it seems only applicable to your data outlive your snapshot. i.e if you run data compaction along side your snapshot expiration, or if you also periodically delete your partition (like if it's daily partitioned and your dataset have a retention period) together with your snapshot expiration, it seem to be fine. 
   
   As for file sequence number, I think it might be helpful but by itself it seem to be hard to use compare to timestamp and snapshotId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#issuecomment-1542925181

   CC @puchengy @ajantha-bhat if you guys are interested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.

ajantha-bhat commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201668626


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   a) Is it a good idea to keep the snapshot id? Because regularly running expire_snapshots can clean up the snapshots and we may not be able to map what operation these files were created from, even with the snapshot id. 
   
   
   b) There was also an ask for "latest sequence number" associated with that partition from the community users during partition stats discussion. 
   
   Do you think modified time is enough and no need for the sequence number? 



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   also cc: @szehon-ho 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.

ajantha-bhat commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201742560


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   I am just worried that most of the snapshots will be expired and we end up not using that field much. 
   The main purpose of storing the snapshot id is for finding what operation has last updated this partition id? In that case, we can store the operation type itself directly maybe. 
   
   > I think sequence number will be good too, but do you mean fileSequenceNumber or dataSequenceNumber? Maybe worth another pr if there's more discussion there?
   
   I guess it is fileSequenceNumber.
   
   Yeah, we can have a separate discussion. I think `data_file_size_in_bytes` per partition can also be one more good candidate for storing here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201717572


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   Hm, what do we think to have both?  (and null if its expired, as per discussion here https://github.com/apache/iceberg/pull/7581#discussion_r1201257740).  If we have snapshot_id, I feel its more useful than the last_update_time, but agree we don't always have it.
   
   I think sequence number will be good too, but do you mean fileSequenceNumber or dataSequenceNumber?  Maybe worth another pr if there's more discussion there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1238967269


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.optional(

Review Comment:
   Sorry we may need to change some test, but do you think we can add it at the end?  I know we sometimes add in the middle if it makes sense, but there's always the small risk some user is selecting some columns by position, so I would say we would put it in the middle only if there's a compelling reason.  Or did you think it makes more sense here?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1202676998


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -139,13 +155,14 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
-
-    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
-      for (ContentFile<?> file : files) {
+    try (CloseableIterable<ManifestEntry<? extends ContentFile<?>>> entries = planEntries(scan)) {
+      for (ManifestEntry<? extends ContentFile<?>> entry : entries) {
+        Snapshot snapshot = table.snapshot(entry.snapshotId());

Review Comment:
   I did some testing, looks like @szehon-ho is correct for the case above. The snapshot which written the given dataFile1 can be expired while snapshot loaded from table would be null and its last updated timestamp would be useless (but it only apply to the case where snapshot is expired but data is never rewritten or retained forever) I added a null check in update logic below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1201717572


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   Hm, what do we think to have both?  (and null if its expired, as per discussion here https://github.com/apache/iceberg/pull/7581#discussion_r1201257740).  If we have snapshot_id, I feel its more useful than the last_update_time, but agree we don't always have it.
   
   I think sequence number will be good too, but do we mean fileSequenceNumber or dataSequenceNumber.  Maybe worth another pr if there's more discussion there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1207628850


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   After some thoughts, I think both will be null if the snapshot is expired.  But I think we are no worse off than the alternative, which is to join the entries + snapshots table for the user to find the last update time.  This also will be null if snapshot is expired.
   
   Perhaps we can have some more persistent storage of snapshot metadata , even after expire?
   
   I think for this case (lastUpdateTime, lastSnapshotId), it is ok to proceed, typically snapshots live several days at least, and I would imagine the user of the tool is interested in the last updated partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1238967269


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,6 +49,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.optional(

Review Comment:
   Sorry we may need to change some test, but do you think we can add it at the end?  I know we sometimes add in the middle if it makes sense, but there's always the small risk some user is selecting some columns by position, so I would say we would put it in the middle only if there's a logical reason.  Or did you put it here for some logical reason?  
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1240132918


##########
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1283,6 +1296,158 @@ public void testUnpartitionedPartitionsTable() {
     TestHelpers.assertEqualsSafe(expectedSchema, expectedRow, actual.get(0));
   }
 
+  @Test
+  public void testPartitionsTableLastUpdatedSnapshot() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
+    Table table = createTable(tableIdentifier, SCHEMA, SPEC);
+    Table partitionsTable = loadTable(tableIdentifier, "partitions");
+    Dataset<Row> df1 =
+        spark.createDataFrame(
+            Lists.newArrayList(new SimpleRecord(1, "1"), new SimpleRecord(2, "2")),
+            SimpleRecord.class);
+    Dataset<Row> df2 =
+        spark.createDataFrame(Lists.newArrayList(new SimpleRecord(2, "20")), SimpleRecord.class);
+
+    df1.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long firstCommitId = table.currentSnapshot().snapshotId();
+
+    // add a second file
+    df2.select("id", "data")
+        .write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+    long secondCommitId = table.currentSnapshot().snapshotId();
+
+    // check if rewrite manifest does not override metadata about data file's creating snapshot
+    RewriteManifests.Result rewriteManifestResult =
+        SparkActions.get().rewriteManifests(table).execute();
+    Assert.assertEquals(
+        "rewrite replaced 2 manifests",
+        2,
+        Iterables.size(rewriteManifestResult.rewrittenManifests()));
+    Assert.assertEquals(
+        "rewrite added 1 manifests", 1, Iterables.size(rewriteManifestResult.addedManifests()));
+
+    List<Row> actual =
+        spark
+            .read()
+            .format("iceberg")
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    GenericRecordBuilder builder =
+        new GenericRecordBuilder(AvroSchemaUtil.convert(partitionsTable.schema(), "partitions"));
+    GenericRecordBuilder partitionBuilder =
+        new GenericRecordBuilder(
+            AvroSchemaUtil.convert(
+                partitionsTable.schema().findType("partition").asStructType(), "partition"));
+    List<GenericData.Record> expected = Lists.newArrayList();
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 1).build())
+            .set("record_count", 1L)
+            .set("file_count", 1)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(firstCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", firstCommitId)
+            .build());
+    expected.add(
+        builder
+            .set("partition", partitionBuilder.set("id", 2).build())
+            .set("record_count", 2L)
+            .set("file_count", 2)
+            .set("position_delete_record_count", 0L)
+            .set("position_delete_file_count", 0)
+            .set("equality_delete_record_count", 0L)
+            .set("equality_delete_file_count", 0)
+            .set("spec_id", 0)
+            .set("last_updated_ms", table.snapshot(secondCommitId).timestampMillis() * 1000)
+            .set("last_updated_snapshot_id", secondCommitId)
+            .build());
+
+    Assert.assertEquals("Partitions table should have two rows", 2, expected.size());
+    Assert.assertEquals("Actual results should have two rows", 2, actual.size());
+    for (int i = 0; i < 2; i += 1) {
+      TestHelpers.assertEqualsSafe(
+          partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
+    }
+
+    // check time travel
+    List<Row> actualAfterFirstCommit =
+        spark
+            .read()
+            .format("iceberg")
+            .option(SparkReadOptions.SNAPSHOT_ID, String.valueOf(secondCommitId))
+            .load(loadLocation(tableIdentifier, "partitions"))
+            .orderBy("partition.id")
+            .collectAsList();
+
+    Assert.assertEquals("Actual results should have one row", 2, actualAfterFirstCommit.size());

Review Comment:
   yeah I guess assertion logic is correct but forgot to update the comment, we shall expect 2 rows since snapshotAfterFirstCommit have 2 records sitting in 2 different partitions id=1 and id=2.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] miR172 commented on pull request #7581: Add last updated timestamp and snapshotId for partition table

Posted by "miR172 (via GitHub)" <gi...@apache.org>.

miR172 commented on PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#issuecomment-1721226239

   This is great! Thank you so much for adding this stat -- it is making our life so much easier. 
   
   When can we expect it to be included in a release? I noticed this wasn't in 1.3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org