You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "ajantha-bhat (via GitHub)" <gi...@apache.org> on 2023/03/06 04:41:41 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6661: Core: Support delete file stats in partitions metadata table

ajantha-bhat opened a new pull request, #6661:
URL: https://github.com/apache/iceberg/pull/6661

   Currently [partitions metadata table](https://iceberg.apache.org/docs/latest/spark-queries/#partitions) only has the data file stats 
   ```
   file_count
   record_count
   ```
   
   When the delete files are present, these stats are inaccurate (as we don't decrement these values). 
   So, capture the delete file stats to give a rough idea about why these stats are inaccurate.
   **Note that we are not applying the deletes to the data file and computing the effective result as it will be a very expensive operation. Users are suggested to execute rewrite_data_files periodically to apply the delete files to the data files.**
   
   Delete file stats to be added:
   
   ```
   pos_delete_file_count
   pos_delete_record_count
   eq_delete_file_count
   eq_delete_record_count
   ```
   
   Note:
   - Docs will be updated in a follow-up PR, probably after renaming file_count, and record_count.
   - The same schema will also be used for the partition stats feature during implementation.
   
   Fixes #6042 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1161811991


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   I wanted to just keep `planFiles` which returns `CloseableIterable<ContentFile<?>>` but I couldn't' succeed at it during transforming it into `ParallelIterable` 



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   I wanted to just keep `planFiles` which returns `CloseableIterable<ContentFile<?>>` but I couldn't succeed at it during transforming it into `ParallelIterable` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1185693123


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -93,31 +129,66 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
         partition.partitionData,
         partition.specId,
         partition.dataRecordCount,
-        partition.dataFileCount);
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
 
-    CloseableIterable<DataFile> datafiles = planDataFiles(scan);
-    for (DataFile dataFile : datafiles) {
-      StructLike partition =
-          PartitionUtil.coercePartition(
-              partitionType, table.specs().get(dataFile.specId()), dataFile.partition());
-      partitions.get(partition).update(dataFile);
+    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {

Review Comment:
   Thanks for closing this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat closed pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat closed pull request #6661: Core: Support delete file stats in partitions metadata table
URL: https://github.com/apache/iceberg/pull/6661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1183535207


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +50,27 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(
                 2, "record_count", Types.LongType.get(), "Count of records in data files"),
             Types.NestedField.required(
-                3, "file_count", Types.IntegerType.get(), "Count of data files"));
+                3, "file_count", Types.IntegerType.get(), "Count of data files"),
+            Types.NestedField.required(
+                5,
+                "pos_delete_record_count",

Review Comment:
   fixed.



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1535063079

   I had to squash the commits for easy rebase and conflict resolution. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1185548566


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -93,31 +129,55 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
         partition.partitionData,
         partition.specId,
         partition.dataRecordCount,
-        partition.dataFileCount);
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
 
-    CloseableIterable<DataFile> datafiles = planDataFiles(scan);
-    for (DataFile dataFile : datafiles) {
-      StructLike partition =
-          PartitionUtil.coercePartition(
-              partitionType, table.specs().get(dataFile.specId()), dataFile.partition());
-      partitions.get(partition).update(dataFile);
+    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
+      for (ContentFile<?> file : files) {
+        StructLike partition =
+            PartitionUtil.coercePartition(
+                partitionType, table.specs().get(file.specId()), file.partition());
+        partitions.get(partition).update(file);
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(e);
     }
 
     return partitions.all();
   }
 
   @VisibleForTesting
-  static CloseableIterable<DataFile> planDataFiles(StaticTableScan scan) {
+  static CloseableIterable<ContentFile<?>> planFiles(StaticTableScan scan) {
     Table table = scan.table();
-    Snapshot snapshot = scan.snapshot();
 
-    CloseableIterable<ManifestFile> dataManifests =
-        CloseableIterable.withNoopClose(snapshot.dataManifests(table.io()));
+    CloseableIterable<ManifestFile> filteredManifests =
+        filteredManifests(scan, table, scan.snapshot().allManifests(table.io()));
+
+    Iterable<CloseableIterable<ContentFile<?>>> tasks =
+        CloseableIterable.transform(
+            filteredManifests,
+            manifest ->
+                CloseableIterable.transform(
+                    ManifestFiles.open(manifest, table.io(), table.specs())
+                        .caseSensitive(scan.isCaseSensitive())
+                        .select(BaseScan.DELETE_SCAN_COLUMNS), // don't select stats columns

Review Comment:
   Nit: can do a switch on content to get either DELETE_SCAN_COLUMNS or SCAN_COLUMNS to make it clearer.



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -58,7 +81,13 @@ public TableScan newScan() {
   @Override
   public Schema schema() {
     if (table().spec().fields().size() < 1) {
-      return schema.select("record_count", "file_count");

Review Comment:
   I think there's actually a bug here, and it only checks latest spec for Unpartitioned.  if we have partition fields before but removed them, we will not show them.  See other metadata tables like BaseFilesTable.schema.  
   
   Anyway its unrelated, but made https://github.com/apache/iceberg/issues/7533 to track it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1185548566


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -93,31 +129,55 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
         partition.partitionData,
         partition.specId,
         partition.dataRecordCount,
-        partition.dataFileCount);
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
 
-    CloseableIterable<DataFile> datafiles = planDataFiles(scan);
-    for (DataFile dataFile : datafiles) {
-      StructLike partition =
-          PartitionUtil.coercePartition(
-              partitionType, table.specs().get(dataFile.specId()), dataFile.partition());
-      partitions.get(partition).update(dataFile);
+    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
+      for (ContentFile<?> file : files) {
+        StructLike partition =
+            PartitionUtil.coercePartition(
+                partitionType, table.specs().get(file.specId()), file.partition());
+        partitions.get(partition).update(file);
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(e);
     }
 
     return partitions.all();
   }
 
   @VisibleForTesting
-  static CloseableIterable<DataFile> planDataFiles(StaticTableScan scan) {
+  static CloseableIterable<ContentFile<?>> planFiles(StaticTableScan scan) {
     Table table = scan.table();
-    Snapshot snapshot = scan.snapshot();
 
-    CloseableIterable<ManifestFile> dataManifests =
-        CloseableIterable.withNoopClose(snapshot.dataManifests(table.io()));
+    CloseableIterable<ManifestFile> filteredManifests =
+        filteredManifests(scan, table, scan.snapshot().allManifests(table.io()));
+
+    Iterable<CloseableIterable<ContentFile<?>>> tasks =
+        CloseableIterable.transform(
+            filteredManifests,
+            manifest ->
+                CloseableIterable.transform(
+                    ManifestFiles.open(manifest, table.io(), table.specs())
+                        .caseSensitive(scan.isCaseSensitive())
+                        .select(BaseScan.DELETE_SCAN_COLUMNS), // don't select stats columns

Review Comment:
   Nit: can do a switch on content to get either DELETE_SCAN_COLUMNS or SCAN_COLUMNS to make it clearer.  Looks a bit weird, but I guess it works as DELETE_SCAN_COLUMNS is superset of SCAN_COLUMNS .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132981630


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Actually that way I thought will be quite expensive (two pass).  
   
   Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code:  ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through those iterators, instead of using the ManifestGroup.planFiles() / FileScanTask way.
   
   That way, we can iterate through the DataFile/ DeleteFile, and collect delete files/data files in one pass.  Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1183129681


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   @ajantha-bhat can we see about this?  Will it work to wrap the manifest iterable to get the right type in planDataFiles()?
   
   ```
   CloseableIterable.transform(ManifestFiles.read(manifest, table.io(), table.specs())
           .caseSensitive(scan.isCaseSensitive())
           .select(BaseScan.SCAN_COLUMNS).entries(), t -> (ContentFile<?>) t);
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1088136888


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   In the past we have general said that readers should not rely on position of fields in tables, only on names. That said every time we do something like this I think we end up breaking somebody. 
   
   If we are gonna move things it probably should go before file and record count, though rather than after all the stats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1087862183


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),

Review Comment:
   I would like to rename `record_count` to `data_record_count` and `file_count` to `data_file_count` to convey a clear message that these stats are not the effective/overall file and record count but just for the data files. 
   
   Not in this PR but as an immediate follow-up PR as I want to keep the review effort to be minimal. 



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   I thought moving the spec-id to the last looks cleaner as all the stats will be together. 
   
   As these are not stored tables and are computed freshly on a query, I thought no need to worry about compatibility. Let me know If I am wrong. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1450306142

   @szehon-ho:  
   a) I have addressed the duplicate delete file problem using `HashSet` instead of `PositionDeletesScan` because we don't have a similar logic for equality deletes and code looks odd. Maybe later we can replace the logic as it is internal logic.
   
   b) I have derived the refactoring work of this PR into a separate one #6975 to reduce the review effort on this PR (we need to merge this first)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1135890172


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Yea, I am thinking something along the lines of what currently BaseFilesTable/ ManifestsTable does, sure let me take a look , when I get a chance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132981630


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Actually that way I thought will be quite expensive (two pass).  
   
   Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code:  ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through it, instead of using the ManifestGroup.planFiles() way.  
   
   That way, we can iterate through , and collect delete files/data files in one pass.  Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1184589305


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -229,18 +281,41 @@ static class Partition {
     private int specId;
     private long dataRecordCount;
     private int dataFileCount;
+    private long posDeleteRecordCount;
+    private int posDeleteFileCount;
+    private long eqDeleteRecordCount;
+    private int eqDeleteFileCount;
 
     Partition(StructLike key) {
       this.key = key;
       this.specId = 0;
       this.dataRecordCount = 0;
       this.dataFileCount = 0;
+      this.posDeleteRecordCount = 0;
+      this.posDeleteFileCount = 0;
+      this.eqDeleteRecordCount = 0;
+      this.eqDeleteFileCount = 0;
     }
 
-    void update(DataFile file) {
-      this.dataRecordCount += file.recordCount();
-      this.dataFileCount += 1;
-      this.specId = file.specId();
+    void update(ContentFile<?> file) {
+      switch (file.content()) {
+        case DATA:
+          this.dataRecordCount += file.recordCount();
+          this.dataFileCount += 1;
+          this.specId = file.specId();
+          break;
+        case POSITION_DELETES:
+          this.posDeleteRecordCount = file.recordCount();
+          this.posDeleteFileCount += 1;
+          break;

Review Comment:
   For this partition value, while updating the data file count, the Spec id would have been updated. 
   I don't think there will be delete files without data file entries. So, I assumed that again updating here would be redundant.  WDYT? 
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1169501029


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   I would say, to try this if we can, both to clean the code and also as we don't do another scan. 
   
   I think it may work by doing 'CloseableIterables.transform()' in the worst case to cast the type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1536617546

   Merged, thanks @ajantha-bhat for the work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132657790


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   I think I would strongly want to consider an approach which either disposes of the set's when the partition info is done being constructed or doesn't use this set approach. It looks like in the current implementation we end up keeping the entire set of delete file objects in memory indefinitely. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1093546826


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -236,5 +273,17 @@ void update(DataFile file) {
       this.fileCount += 1;
       this.specId = file.specId();
     }
+
+    void update(DeleteFile file) {
+      if (file.content() == FileContent.POSITION_DELETES) {
+        this.posDeleteRecordCount += file.recordCount();

Review Comment:
   Yea I think this has the  same problem I mentioned in the earlier pr : https://github.com/apache/iceberg/pull/4005  (which was doing the same thing).  Summing will not give a unique delete file count.
   
   I think one option as I mentioned it to use the new PositionDeletes scan : https://github.com/apache/iceberg/pull/4812  , which you can then use to stream and update partitions.
   
   It's probably less memory intensive than trying to cache every existing delete file in a set, which would be the other solution for de-duping.
   
   Would like to see what @RussellSpitzer thinks 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1135925049


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Thanks. I will also explore those files.
   
   BTW I have updated to Map<String, Integer> instead of set<DataFile> and cleared it after the usage now. But looking forward to solving it as you suggested by changing the `planFiles`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1501942540

   @szehon-ho, @RussellSpitzer, @jackye1995: I have reworked the PR based on #7189 now. Please review this PR again.     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1123598992


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   So I am not so confident, that if for example you have two FileScanTask, returning deletes() that are the same, will the two DeleteFile object be different or not?  Java default equals() is instance equality, isnt it?  And even if it does work, will break if we implement those at some point.  Hence the suggestion to use something with established hashCode/equals.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1123598992


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   So I am not so confident, that if for example you have two FileScanTask, returning deletes() that are the same logical file, will the two DeleteFile object be different or not?  Java default equals() is instance equality, isnt it?  And even if it does work, will break if we implement those at some point.  Hence the suggestion to use something with established hashCode/equals.



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   So I am not so confident, that if for example you have two FileScanTask, returning deletes() that are the same logical file, will the two DeleteFile object be different or not?  Java default equals() is instance equality, isnt it?  And even if it does work, may break if we implement those at some point.  Hence the suggestion to use something with established hashCode/equals.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1123065113


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {
+      this.dataRecordCount += task.file().recordCount();
+      this.dataFileCount += 1;
+      this.specId = task.file().specId();
+      if (TableScanUtil.hasDeletes(task)) {

Review Comment:
   I think it internally calls exactly that. So, no need to change? 
   https://github.com/apache/iceberg/blob/fa0bcdf2593bd70425a4cae0148b26f96b1e46b1/core/src/main/java/org/apache/iceberg/util/TableScanUtil.java#L68-L70 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1097102496


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),

Review Comment:
   Wouldn't it be a breaking change? If someone already has a client that expects the field names 'record_count' and 'file_count' then it would break between minor Iceberg releases. I think changes like this should happen in a major release and before that we might want to mark these fields as deprecated while also introducing the ones with the new names.
   I'm open for opinions from others, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1122537693


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Can you double check DeleteFile equal/hashcode are working correctly, if conflict?  Should we rather do something safer, like paths?  Maybe it will even be better for memory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1123597227


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {
+      this.dataRecordCount += task.file().recordCount();
+      this.dataFileCount += 1;
+      this.specId = task.file().specId();
+      if (TableScanUtil.hasDeletes(task)) {

Review Comment:
   You are right, I was reading the other TableScanUtil.hasDeletes method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1124002815


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   The same delete file object from the context is reused while making the fileScan here. 
   https://github.com/apache/iceberg/blob/d42d1e89c0616c203f7ad29f002811ddd440e14f/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L351
   
   hence, I believe the default equals() is enough. 
   
   I don't disagree about having the equals and hashcode for `DataFiles` and `DeleteFiles`. But code is widely using the Set<DataFile> and Set<DeleteFile> already. So, if we are adding it. It should be in a separate PR/discussion. 
   
   Let us see what others think on this. 
   cc: @RussellSpitzer, @jackye1995  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1104038806


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -236,5 +273,17 @@ void update(DataFile file) {
       this.fileCount += 1;
       this.specId = file.specId();
     }
+
+    void update(DeleteFile file) {
+      if (file.content() == FileContent.POSITION_DELETES) {
+        this.posDeleteRecordCount += file.recordCount();

Review Comment:
   yes. Please check it. Thanks. I will also try to understand duplication of file count problem this week. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1169501029


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -90,54 +118,60 @@ private DataTask task(StaticTableScan scan) {
 
   private static StaticDataTask.Row convertPartition(Partition partition) {
     return StaticDataTask.Row.of(
-        partition.key, partition.specId, partition.dataRecordCount, partition.dataFileCount);
+        partition.key,
+        partition.specId,
+        partition.dataRecordCount,
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType normalizedPartitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap();
 
-    // cache a position map needed by each partition spec to normalize partitions to final schema
-    Map<Integer, int[]> normalizedPositionsBySpec =
-        Maps.newHashMapWithExpectedSize(table.specs().size());
+    try (CloseableIterable<DataFile> datafiles = planDataFiles(scan);

Review Comment:
   I would say, to try this if we can, both to clean the code and also as we don't do another scan for perf reasons.
   
   I think it may work by doing 'CloseableIterables.transform()' in the worst case to cast the type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132708087


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   I also had the similar concern, and thought of using PositionDeletesTableScan, but it seems there is no equivalent EqualityDeletesTableScan yet.  
   
   Maybe another approach here could be, going through each partition and using DeleteFileIndex one by one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132981630


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Actually that way I thought will be quite expensive (two pass).  
   
   Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code:  ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through those iterators, instead of using the ManifestGroup.planFiles() / FileScanTask way.
   
   That way, we can iterate through the DataFile/ DeleteFile, and collect delete files/data files in one pass without keeping them in memory.  It's definitely do-able but will be a bit more work though.  Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1125356413


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Yea it may work due to cache inside DeleteIndex, but I don't know.  I see Set<DeleteFile> but it seems those case are where they are limited to not conflict with each other except instance equality.  I would also be interested in the best way here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1124002815


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   The same delete file object from the context is reused while making the fileScan here. 
   https://github.com/apache/iceberg/blob/d42d1e89c0616c203f7ad29f002811ddd440e14f/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L351
   
   hence, I believe the default equals() is enough. 
   
   I don't disagree about having the equals and hashcode for `DataFile` and `DeleteFile`. But code is widely using the `Set<DataFile>` and `Set<DeleteFile>` already. So, if we are adding it. It should be in a separate PR/discussion. 
   
   Let us see what others think on this. 
   cc: @RussellSpitzer, @jackye1995  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1184409917


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -229,18 +281,41 @@ static class Partition {
     private int specId;
     private long dataRecordCount;
     private int dataFileCount;
+    private long posDeleteRecordCount;
+    private int posDeleteFileCount;
+    private long eqDeleteRecordCount;
+    private int eqDeleteFileCount;
 
     Partition(StructLike key) {
       this.key = key;
       this.specId = 0;
       this.dataRecordCount = 0;
       this.dataFileCount = 0;
+      this.posDeleteRecordCount = 0;
+      this.posDeleteFileCount = 0;
+      this.eqDeleteRecordCount = 0;
+      this.eqDeleteFileCount = 0;
     }
 
-    void update(DataFile file) {
-      this.dataRecordCount += file.recordCount();
-      this.dataFileCount += 1;
-      this.specId = file.specId();
+    void update(ContentFile<?> file) {
+      switch (file.content()) {
+        case DATA:
+          this.dataRecordCount += file.recordCount();
+          this.dataFileCount += 1;
+          this.specId = file.specId();
+          break;
+        case POSITION_DELETES:
+          this.posDeleteRecordCount = file.recordCount();
+          this.posDeleteFileCount += 1;
+          break;

Review Comment:
   How about specId here and below?



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1382,6 +1404,76 @@ public void testPartitionsTable() {
       TestHelpers.assertEqualsSafe(
           partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
     }
+
+    testDeleteStats(tableIdentifier, table, partitionsTable, builder, partitionBuilder, expected);

Review Comment:
   Could we make this test independently?  As the existing test is already quite long, and it's doing more table modifications inside the new method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1185665936


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -93,31 +129,55 @@ private static StaticDataTask.Row convertPartition(Partition partition) {
         partition.partitionData,
         partition.specId,
         partition.dataRecordCount,
-        partition.dataFileCount);
+        partition.dataFileCount,
+        partition.posDeleteRecordCount,
+        partition.posDeleteFileCount,
+        partition.eqDeleteRecordCount,
+        partition.eqDeleteFileCount);
   }
 
   private static Iterable<Partition> partitions(Table table, StaticTableScan scan) {
     Types.StructType partitionType = Partitioning.partitionType(table);
     PartitionMap partitions = new PartitionMap(partitionType);
 
-    CloseableIterable<DataFile> datafiles = planDataFiles(scan);
-    for (DataFile dataFile : datafiles) {
-      StructLike partition =
-          PartitionUtil.coercePartition(
-              partitionType, table.specs().get(dataFile.specId()), dataFile.partition());
-      partitions.get(partition).update(dataFile);
+    try (CloseableIterable<ContentFile<?>> files = planFiles(scan)) {
+      for (ContentFile<?> file : files) {
+        StructLike partition =
+            PartitionUtil.coercePartition(
+                partitionType, table.specs().get(file.specId()), file.partition());
+        partitions.get(partition).update(file);
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(e);
     }
 
     return partitions.all();
   }
 
   @VisibleForTesting
-  static CloseableIterable<DataFile> planDataFiles(StaticTableScan scan) {
+  static CloseableIterable<ContentFile<?>> planFiles(StaticTableScan scan) {
     Table table = scan.table();
-    Snapshot snapshot = scan.snapshot();
 
-    CloseableIterable<ManifestFile> dataManifests =
-        CloseableIterable.withNoopClose(snapshot.dataManifests(table.io()));
+    CloseableIterable<ManifestFile> filteredManifests =
+        filteredManifests(scan, table, scan.snapshot().allManifests(table.io()));
+
+    Iterable<CloseableIterable<ContentFile<?>>> tasks =
+        CloseableIterable.transform(
+            filteredManifests,
+            manifest ->
+                CloseableIterable.transform(
+                    ManifestFiles.open(manifest, table.io(), table.specs())
+                        .caseSensitive(scan.isCaseSensitive())
+                        .select(BaseScan.DELETE_SCAN_COLUMNS), // don't select stats columns

Review Comment:
   updated it with the switch. 
   
   `DELETE_SCAN_COLUMNS` has only content type as an extra field. Which can work for both the data and the delete file. 
   Agree that the name looks weird when used for generic content. `CONTENT_SCAN_COLUMNS` would have been the more suitable name for it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1185661945


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -58,7 +81,13 @@ public TableScan newScan() {
   @Override
   public Schema schema() {
     if (table().spec().fields().size() < 1) {
-      return schema.select("record_count", "file_count");

Review Comment:
   Good catch. I will explore this in the follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132981630


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   Actually that way I thought will be quite expensive (two pass).  
   
   Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code:  ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through those iterators, instead of using the ManifestGroup.planFiles() / FileScanTask way.
   
   That way, we can iterate through the DataFile/ DeleteFile, and collect number of delete files/data files in one pass without keeping them in memory.  It's definitely do-able but will be a bit more work though.  Any thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1093546826


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -236,5 +273,17 @@ void update(DataFile file) {
       this.fileCount += 1;
       this.specId = file.specId();
     }
+
+    void update(DeleteFile file) {
+      if (file.content() == FileContent.POSITION_DELETES) {
+        this.posDeleteRecordCount += file.recordCount();

Review Comment:
   Yea I think this has the  same problem I mentioned in the earlier pr : https://github.com/apache/iceberg/pull/4005  (which was doing the same thing).  Summing will not give a unique delete file count.
   
   I think one option as I mentioned is to use the new PositionDeletes scan : https://github.com/apache/iceberg/pull/6365  , which you can then use to stream and update partitions.
   
   It's probably less memory intensive than trying to cache every existing delete file in a set, which would be the other solution for de-duping.
   
   Would like to see what @RussellSpitzer thinks 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1095635593


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   Thanks for linking the previous discussions. I will keep the name and id as it is while moving the field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1095640760


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -236,5 +273,17 @@ void update(DataFile file) {
       this.fileCount += 1;
       this.specId = file.specId();
     }
+
+    void update(DeleteFile file) {
+      if (file.content() == FileContent.POSITION_DELETES) {
+        this.posDeleteRecordCount += file.recordCount();

Review Comment:
   oh. I didn't know there was a PR that worked on this. 
   Thanks for pointing to it. I will take a look at it. 
   
   What about de-dup for equality deletes? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1169503010


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +50,27 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(
                 2, "record_count", Types.LongType.get(), "Count of records in data files"),
             Types.NestedField.required(
-                3, "file_count", Types.IntegerType.get(), "Count of data files"));
+                3, "file_count", Types.IntegerType.get(), "Count of data files"),
+            Types.NestedField.required(
+                5,
+                "pos_delete_record_count",

Review Comment:
   I think 'position_delete', 'equality_delete' may be better in the full form, to match SnapshotSummary.  Maybe even 'position_delete_count' , 'equality_delete_count' (SnapshotSummary has added_position_deletes, added_equality_deletes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1099372207


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),

Review Comment:
   Yea I'm not too sure about renaming existing fields, I feel that's definitely more likely to break something than changing positions around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1103059479


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -236,5 +273,17 @@ void update(DataFile file) {
       this.fileCount += 1;
       this.specId = file.specId();
     }
+
+    void update(DeleteFile file) {
+      if (file.content() == FileContent.POSITION_DELETES) {
+        this.posDeleteRecordCount += file.recordCount();

Review Comment:
   Yea that's not there yet.  Let me refer to @aokolnychyi @RussellSpitzer  to see which is better here, maybe going with a set is the only approach for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1123011566


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   https://github.com/apache/iceberg/blob/76c1df0cfde27167d31c112c81daa2d4fcc96686/api/src/main/java/org/apache/iceberg/RewriteFiles.java#L71
   
   `Set<DeleteFile>` was already used in the existing code, as seen above. 
   But there are no equals() and hashcode() impl for these classes. Which means it is using the default ones. 
   
   Note that `Set<DataFile>` also exist in the code and it also doesn't have equal/hashcode impl. 
   
   I am ok with using Paths. But just wondering why it doesn't exist and maybe we can handle these in a follow-up PR instead of this one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1455440255

   build failed due to flaky failure. Debugging in a separate PR
   https://github.com/apache/iceberg/pull/7010
   
   Will just retrigger the build. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1125356413


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Yea it may work due to cache inside DeleteIndex, but I don't know.  I see `Set<DeleteFile>` but it seems those case are where they are limited to not conflict with each other except instance equality.  I would also be interested in the best way here.  Will ping @aokolnychyi  to take a look as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1405302607

   cc: @szehon-ho, @RussellSpitzer, @rdblue, @flyrain    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1093540298


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   So being said, we should not change the id of spec_id here, but ok to move it to after partition as you guys are saying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1532959988

   @szehon-ho: PR is ready for review. Thanks.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1122537693


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Can you double check DeleteFile equal/hashcode are working correctly, if conflict?  Should we rather do something safer, like paths?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {
+      this.dataRecordCount += task.file().recordCount();
+      this.dataFileCount += 1;
+      this.specId = task.file().specId();
+      if (TableScanUtil.hasDeletes(task)) {

Review Comment:
   Why not just, if task.deletes().size() > 0?   I think we are getting FileScanTask here, not CombinedScanTask, isnt it?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -45,9 +48,17 @@ public class PartitionsTable extends BaseMetadataTable {
     this.schema =
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
-            Types.NestedField.required(2, "record_count", Types.LongType.get()),
-            Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));
+            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                2,
+                "record_count",
+                Types.LongType.get(),
+                "data record count without applying the deletes"),

Review Comment:
   Can we do a positive description like:  "count of records in data files".  Not sure "without applying deletes" sounds very clear.
   
   If we add comments to "pos_delete_record_count" and "eq_delete_record_count". "count of records in position delete files", etc, I think that should be better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132655933


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   I'd be more worried about the memory usage here, seems like we have to keep the entire set of all delete files in memory (for all partitions) while we are building this table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1529288390

   I was out of office last week. Hopefully, I can work on it this week and get it merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1528178614

   HI, @ajantha-bhat do you still plan to work on this?  We are also interested in this and can also give a try to parameterize the method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1088486957


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   Ok. Thanks. I think we can move spec_id to field id = 2 (before file and record count) 
   
   @szehon-ho : What is your opinion on this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1135459554


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1135420154


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   > Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code: ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through those iterators, instead of using the ManifestGroup.planFiles() / FileScanTask way.
   
   @szehon-ho: I have spent some time and realized that I am not super familiar with this side of code.  Would you like to contribute a PR for data files for replacing `ManifestGroup.planFiles()`? I can then extend it to delete files and handle these stats updates. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1111743858


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),

Review Comment:
   yeah. Got it. I will just update the document that these are data file fields instead of actually renaming the fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1184708937


##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1382,6 +1404,76 @@ public void testPartitionsTable() {
       TestHelpers.assertEqualsSafe(
           partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
     }
+
+    testDeleteStats(tableIdentifier, table, partitionsTable, builder, partitionBuilder, expected);

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho merged PR #6661:
URL: https://github.com/apache/iceberg/pull/6661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1184709717


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -229,18 +281,41 @@ static class Partition {
     private int specId;
     private long dataRecordCount;
     private int dataFileCount;
+    private long posDeleteRecordCount;
+    private int posDeleteFileCount;
+    private long eqDeleteRecordCount;
+    private int eqDeleteFileCount;
 
     Partition(StructLike key) {
       this.key = key;
       this.specId = 0;
       this.dataRecordCount = 0;
       this.dataFileCount = 0;
+      this.posDeleteRecordCount = 0;
+      this.posDeleteFileCount = 0;
+      this.eqDeleteRecordCount = 0;
+      this.eqDeleteFileCount = 0;
     }
 
-    void update(DataFile file) {
-      this.dataRecordCount += file.recordCount();
-      this.dataFileCount += 1;
-      this.specId = file.specId();
+    void update(ContentFile<?> file) {
+      switch (file.content()) {
+        case DATA:
+          this.dataRecordCount += file.recordCount();
+          this.dataFileCount += 1;
+          this.specId = file.specId();
+          break;
+        case POSITION_DELETES:
+          this.posDeleteRecordCount = file.recordCount();
+          this.posDeleteFileCount += 1;
+          break;

Review Comment:
   updated it now thinking if the delete happens after the partition evolution, it should reflect the latest spec id. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1184411901


##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java:
##########
@@ -1382,6 +1404,76 @@ public void testPartitionsTable() {
       TestHelpers.assertEqualsSafe(
           partitionsTable.schema().asStruct(), expected.get(i), actual.get(i));
     }
+
+    testDeleteStats(tableIdentifier, table, partitionsTable, builder, partitionBuilder, expected);

Review Comment:
   Could we make this another test?  As the existing test is already quite long and hard to read, and it's doing more table modifications inside the new method, seems like it fits another test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1093539407


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,7 +48,11 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
             Types.NestedField.required(2, "record_count", Types.LongType.get()),
             Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));

Review Comment:
   @RussellSpitzer I think it was , we should keep ids/names but not necessarily order.  Ref: https://github.com/apache/iceberg/pull/3015#issuecomment-951133690
   
   Didn't see we broke anything that time, we resolved https://github.com/apache/iceberg/pull/3015#discussion_r753631713 eventually as a classpath issue with old classes still in play.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1132659720


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   I think i agree in this case, it's probably best to go forward with just storing their paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1125356413


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Yea it may work due to cache inside DeleteIndex, but I don't know.  I see `Set<DeleteFile>` but it seems those case are where they are limited to not conflict with each other except instance equality.  I would also be interested in the best way here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat closed pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat closed pull request #6661: Core: Support delete file stats in partitions metadata table
URL: https://github.com/apache/iceberg/pull/6661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1124002815


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   The same delete file object from the context is reused while making the fileScan here. 
   https://github.com/apache/iceberg/blob/d42d1e89c0616c203f7ad29f002811ddd440e14f/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L351
   
   hence, I believe the default equals() is enough. 
   
   I don't disagree about having the equals and hashcode for `DataFiles` and `DeleteFiles`. But code is widely using the `Set<DataFile>` and `Set<DeleteFile>` already. So, if we are adding it. It should be in a separate PR/discussion. 
   
   Let us see what others think on this. 
   cc: @RussellSpitzer, @jackye1995  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1143984302


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,7 +50,19 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(
                 2, "record_count", Types.LongType.get(), "Count of records in data files"),
             Types.NestedField.required(
-                3, "file_count", Types.IntegerType.get(), "Count of data files"));
+                3, "file_count", Types.IntegerType.get(), "Count of data files"),
+            Types.NestedField.required(
+                5,
+                "pos_delete_record_count",
+                Types.LongType.get(),
+                "Count of records in position delete files"),
+            Types.NestedField.required(6, "pos_delete_file_count", Types.IntegerType.get()),

Review Comment:
   can we also add descriptions for the file count fields?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#issuecomment-1478847442

   FYI, we are looking at refactor of existing partitions table to make this task easier ( @dramaticlly also mentioned interest in looking at it as well and may contribute)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1142878835


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -49,7 +50,19 @@ public class PartitionsTable extends BaseMetadataTable {
             Types.NestedField.required(
                 2, "record_count", Types.LongType.get(), "Count of records in data files"),
             Types.NestedField.required(
-                3, "file_count", Types.IntegerType.get(), "Count of data files"));
+                3, "file_count", Types.IntegerType.get(), "Count of data files"),
+            Types.NestedField.required(
+                5,

Review Comment:
   Note:
   spec_id (4) is present at the top.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1135420154


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +257,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {

Review Comment:
   > Probably the only way to effectively do it , until this whole table is migrated over to some kind of view of 'files' table, is to rewrite the PartitionsTableScan to directly use the underlying code: ManifestReader.readDeleteManifest() / ManifestReader.read(), and then go through those iterators, instead of using the ManifestGroup.planFiles() / FileScanTask way.
   
   @szehon-ho: I have spent some time and realized that I am not super familiar with this side of code.  Would you like to contribute a PR for data files for this? I can then extend it to delete files and handle these stats updates. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org