You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "RussellSpitzer (via GitHub)" <gi...@apache.org> on 2023/01/27 19:08:57 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request, #6680: Core: DeleteWithFilter fails on HashCode Collision

RussellSpitzer opened a new pull request, #6680:
URL: https://github.com/apache/iceberg/pull/6680

   Fixes #6670 
   
   When we determine the ResidualEvaluators for manifest entries in file we use a computeIfAbsent method using the file object's PartitionData as a key. The underlying issue here is that when being read using ManifestReader the PartitionData object is reused. This means once placed within the Map the value of the PartitionData changes every time a new entry is read. Because the original hashcode is correct, this isn't a problem until two values collide. Once they do the second key will end up retrieving the value of the first key, the underlying key retrieved will then also be equal because of the ManifestReader Container re-use. If the First Key was "always true" but the second key should be "false" the second key will return true and delete a file it should not.
   
   To fix this we only place a copy of the PartitionData inside the map instead of the PartitionData object itself. We can't use computeIfAbsent unless we want to make a brand new PartitionData object for every entry.
   
   
   To note the reason this has not been hit frequently before is that we must hit all of following conditions
   
   * There needs to be a conflict of structlikewrapper.wrap(PartitionData) hashcodes
   * The two files with the conflicting partition hashcodes MUST be in the same manifest
   * They must have different behaviors for the delete condition (eg if both collision values should be delete or not deleted then it isn't a problem)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089477227


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();
+      if (!metricsEvaluators.containsKey(partition)) {

Review Comment:
   OK makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089478498


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)

Review Comment:
   Can't we remove metrics variable to simplify the test?  
   
   I see other DataFiles being built in TableTestBase without it, wonder why we need it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089811943


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Using StructCopy is actually going to be a bit of a bigger change if we want to keep the interface package private since we need to move it into the `org.apache.iceberg` package.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089416208


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)
+            .build();
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withMetrics(metrics)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());
+
+    collisionTable.newDelete().deleteFromRowFilter(Expressions.equal("x", "BB")).commit();
+
+    Assert.assertEquals(
+        "We should have deleted one of them",
+        1,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());

Review Comment:
   Sure, that's a great idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089478498


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)

Review Comment:
   Can we remove metrics variable to simplify the test?  
   
   I see other DataFiles being built in TableTestBase without it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089341074


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)
+            .build();
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withMetrics(metrics)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());
+
+    collisionTable.newDelete().deleteFromRowFilter(Expressions.equal("x", "BB")).commit();
+
+    Assert.assertEquals(
+        "We should have deleted one of them",
+        1,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());

Review Comment:
   nit: might be a bit redundant, but I am curious if we shall verify the remaining files is correct and we deleted the desired partition instead?
   
   I am thinking about maybe add something like 
   ```java
   Assert.assertEquals(
     partitionOne,
     collisionTable.newScan().planFiles().iterator().next().file().partition());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089493935


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +355,56 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withRecordCount(1)
+            .build();
+
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withRecordCount(1)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());
+
+    collisionTable.newDelete().deleteFromRowFilter(Expressions.equal("x", "BB")).commit();
+
+    List<StructLike> partitions =
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).stream()
+            .map(s -> ((PartitionData) s.partition()).copy())

Review Comment:
   Not that it matters here but let's be a good example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089805842


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Should we use `StructLikeMap` here instead of relying on `PartitionData` in a regular map? We could use a pattern like in `FanoutWriter` to keep these by partition spec ID.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089519883


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +355,56 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withRecordCount(1)
+            .build();
+
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withRecordCount(1)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());
+
+    collisionTable.newDelete().deleteFromRowFilter(Expressions.equal("x", "BB")).commit();
+
+    List<StructLike> partitions =
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).stream()
+            .map(s -> ((PartitionData) s.partition()).copy())
+            .collect(Collectors.toList());
+
+    Assert.assertEquals(
+        "We should have deleted partitionTwo of them", ImmutableList.of(partitionOne), partitions);

Review Comment:
   Run on here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089813726


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Ok I found a bug in StructCopy as well, it doesn't copy FixedByte types and just copies array references so we actually probably need to file issue for that.
   
   It should be doing this
   
   From PartitionData.copyData
   ```java
             case FIXED:
               byte[] buffer = (byte[]) data[i];
               copy[i] = Arrays.copyOf(buffer, buffer.length);
               break;
    ```
   
   but instead the StructCopy code always does this
   ```java
         if (value instanceof StructLike) {
           values[i] = copy((StructLike) value);
         } else {
           values[i] = value;
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer merged PR #6680:
URL: https://github.com/apache/iceberg/pull/6680


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089385138


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();
+      if (!metricsEvaluators.containsKey(partition)) {

Review Comment:
   Will the fix still work if we just keep computeIfAbsent, but pass file.partition().copy()?



##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)

Review Comment:
   Is metrics strictly required here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089817726


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Filed #6685 to track StructCopy bug



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#issuecomment-1407483685

   > Looks correct to me, though we may want to move to `StructLikeMap` that was introduced later.
   
   `metricsEvaluators` is a struct like map which when putting copies the Wrapper but not the internal struct data. Since the underlying object reference is not copied we have the same issue. 
   
   https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/util/StructLikeMap.java#L86
   
   and then
   
   https://github.com/apache/iceberg/blob/a76724f7ae7ff3722fea3e1145aa0a978c721b45/core/src/main/java/org/apache/iceberg/util/StructLikeWrapper.java#L59 
   
   Since the "set" method just assigns `this.struct = newStruct` we have this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089416894


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)

Review Comment:
   Yep (just row count really) but it's required by DeleteFileBuilder



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089341074


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +352,54 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    Metrics metrics = new Metrics(1L, null, null, null, null);
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withMetrics(metrics)
+            .build();
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withMetrics(metrics)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());
+
+    collisionTable.newDelete().deleteFromRowFilter(Expressions.equal("x", "BB")).commit();
+
+    Assert.assertEquals(
+        "We should have deleted one of them",
+        1,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());

Review Comment:
   nit: might be a bit redundant, but I am curious if we shall verify the remaining files is correct and we deleted the desired partition instead?
   
   I am thinking about maybe something like 
   ```java
   Assert.assertEquals(
     partitionOne,
     collisionTable.newScan().planFiles().iterator().next().file().partition());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089810563


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +355,56 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withRecordCount(1)
+            .build();
+
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withRecordCount(1)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());

Review Comment:
   Changed this to explicitly check for partitionOne and Two



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089813726


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Ok I found a bug in StructCopy as well, it doesn't copy FixedByte types and just copies array values so we actually probably need to file a issue for that.
   
   It should be doing this
   
   From PartitionData.copyData
   ```java
             case FIXED:
               byte[] buffer = (byte[]) data[i];
               copy[i] = Arrays.copyOf(buffer, buffer.length);
               break;
    ```
   
   but instead the StructCopy code always does this
   ```java
         if (value instanceof StructLike) {
           values[i] = copy((StructLike) value);
         } else {
           values[i] = value;
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#issuecomment-1408917506

   Merged! Thanks @amogh-jahagirdar , @rdblue , @stevenzwu , @abmo-x , @szehon-ho, @dramaticlly .
   
   I would also again like to greatly credit Steven and Szehon for their hard work in debugging this issue. Once more, I would like to say Szehon you were right, it was a hashcode collision issue :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089806242


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   It is a StructLikeMap, this actually causes the collisions to happen more frequently because of the structlikewrapper hashcode function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#issuecomment-1406959020

   @aokolnychyi + @dramaticlly @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089416606


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();
+      if (!metricsEvaluators.containsKey(partition)) {

Review Comment:
   Yes but then we make a copy of every single partition object, so we basically lose all benefit from reuse containers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089811146


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -510,16 +510,19 @@ private Pair<InclusiveMetricsEvaluator, StrictMetricsEvaluator> metricsEvaluator
       // in other words, ResidualEvaluator returns a part of the expression that needs to be
       // evaluated
       // for rows in the given partition using metrics
-      return metricsEvaluators.computeIfAbsent(
-          file.partition(),
-          partition -> {
-            Expression residual = residualEvaluator.residualFor(partition);
-            InclusiveMetricsEvaluator inclusive =
-                new InclusiveMetricsEvaluator(tableSchema, residual, caseSensitive);
-            StrictMetricsEvaluator strict =
-                new StrictMetricsEvaluator(tableSchema, residual, caseSensitive);
-            return Pair.of(inclusive, strict);
-          });
+      PartitionData partition = (PartitionData) file.partition();

Review Comment:
   Checking fanout writer, it actually does the same pattern I implemented here, but uses StructCopy.copy instead of the PartitionData.copy method. That may be a safer thing to do since technically we are not guaranteed the PartitionData type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6680: Core: DeleteWithFilter fails on HashCode Collision

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on code in PR #6680:
URL: https://github.com/apache/iceberg/pull/6680#discussion_r1089643840


##########
core/src/test/java/org/apache/iceberg/TestDeleteFiles.java:
##########
@@ -349,6 +355,56 @@ public void testDeleteFilesOnIndependentBranches() {
         statuses(Status.EXISTING, Status.DELETED, Status.DELETED));
   }
 
+  @Test
+  public void testDeleteWithCollision() {
+    Schema schema = new Schema(Types.NestedField.of(0, false, "x", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("x").build();
+    Table collisionTable =
+        TestTables.create(tableDir, "hashcollision", schema, spec, formatVersion);
+
+    PartitionData partitionOne = new PartitionData(spec.partitionType());
+    partitionOne.set(0, "Aa");
+    PartitionData partitionTwo = new PartitionData(spec.partitionType());
+    partitionTwo.set(0, "BB");
+
+    Assert.assertEquals(
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionOne).hashCode(),
+        StructLikeWrapper.forType(spec.partitionType()).set(partitionTwo).hashCode());
+
+    DataFile testFileOne =
+        DataFiles.builder(spec)
+            .withPartition(partitionOne)
+            .withPath("/g1.parquet")
+            .withFileSizeInBytes(100)
+            .withRecordCount(1)
+            .build();
+
+    DataFile testFileTwo =
+        DataFiles.builder(spec)
+            .withPartition(partitionTwo)
+            .withRecordCount(1)
+            .withFileSizeInBytes(100)
+            .withPath("/g2.parquet")
+            .build();
+
+    collisionTable.newFastAppend().appendFile(testFileOne).appendFile(testFileTwo).commit();
+
+    Assert.assertEquals(
+        "We should have two files",
+        2,
+        Lists.newArrayList(collisionTable.newScan().planFiles().iterator()).size());

Review Comment:
   nit: maybe assert on two partitions here? correspond to the assertion after deletion that there is only one partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org