You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "szehon-ho (via GitHub)" <gi...@apache.org> on 2023/03/02 02:37:59 UTC

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6661: Core: Support delete file stats in partitions metadata table

szehon-ho commented on code in PR #6661:
URL: https://github.com/apache/iceberg/pull/6661#discussion_r1122537693


##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;

Review Comment:
   Can you double check DeleteFile equal/hashcode are working correctly, if conflict?  Should we rather do something safer, like paths?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -220,21 +251,53 @@ Iterable<Partition> all() {
 
   static class Partition {
     private final StructLike key;
-    private long recordCount;
-    private int fileCount;
     private int specId;
+    private long dataRecordCount;
+    private int dataFileCount;
+
+    private final Set<DeleteFile> equalityDeleteFiles;
+    private final Set<DeleteFile> positionDeleteFiles;
 
     Partition(StructLike key) {
       this.key = key;
-      this.recordCount = 0;
-      this.fileCount = 0;
       this.specId = 0;
+      this.dataRecordCount = 0;
+      this.dataFileCount = 0;
+      this.positionDeleteFiles = Sets.newHashSet();
+      this.equalityDeleteFiles = Sets.newHashSet();
+    }
+
+    private void update(FileScanTask task) {
+      this.dataRecordCount += task.file().recordCount();
+      this.dataFileCount += 1;
+      this.specId = task.file().specId();
+      if (TableScanUtil.hasDeletes(task)) {

Review Comment:
   Why not just, if task.deletes().size() > 0?   I think we are getting FileScanTask here, not CombinedScanTask, isnt it?



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -45,9 +48,17 @@ public class PartitionsTable extends BaseMetadataTable {
     this.schema =
         new Schema(
             Types.NestedField.required(1, "partition", Partitioning.partitionType(table)),
-            Types.NestedField.required(2, "record_count", Types.LongType.get()),
-            Types.NestedField.required(3, "file_count", Types.IntegerType.get()),
-            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()));
+            Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                2,
+                "record_count",
+                Types.LongType.get(),
+                "data record count without applying the deletes"),

Review Comment:
   Can we do a positive description like:  "count of records in data files".  Not sure "without applying deletes" sounds very clear.
   
   If we add comments to "pos_delete_record_count" and "eq_delete_record_count". "count of records in position delete files", etc, I think that should be better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org