You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "abmo-x (via GitHub)" <gi...@apache.org> on 2023/02/13 06:54:06 UTC

[GitHub] [iceberg] abmo-x opened a new pull request, #6818: Fix: add_files data loss - old partitions get lost when new partition is added

abmo-x opened a new pull request, #6818:
URL: https://github.com/apache/iceberg/pull/6818

   Fix for https://github.com/apache/iceberg/issues/6817
   
   When two spark application runs end up with same manifest file name of format `"stage-%d-task-%d-manifest"` due to same stage/task planning, the add_file procedure ends up overwriting the manifest resulting in data loss


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105128909


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   Or you can just see if you can make a new thread and set a new context? I think there are some ways to test this without a full integration test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105116672


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   I tried that and it was kinda hard to reproduce as we need spark to generate the same stageId and taskAttemptId. I did try to resetSpark before two add_file calls but due to spark it doesn't generate the same ids. 
   
   I can change it into an integration test and try to reproduce it.



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addPartitionToPartitionedSnapshotIdInheritanceEnabled() {
+    createPartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)"
+            + "TBLPROPERTIES ('%s'='true')";
+
+    sql(createIceberg, tableName, TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED);
+
+    scalarSql(
+        "CALL %s.system.add_files('%s', '`parquet`.`%s`', map('id', 1))",
+        catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    // verify manifest file name has uuid pattern
+    String manifestPath = (String) sql("select path from %s.manifests", tableName).get(0)[0];
+
+    Pattern uuidPattern = Pattern.compile("[a-f0-9]{8}(?:-[a-f0-9]{4}){4}[a-f0-9]{8}");
+
+    Matcher matcher = uuidPattern.matcher(manifestPath);
+    Assert.assertTrue("verify manifest path has uuid", matcher.find());

Review Comment:
   same as https://github.com/apache/iceberg/pull/6818#discussion_r1105116672



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105131735


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   got it, let me try that. Thanks for the pointers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1428769078

   I should be able to take a look later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105189453


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -372,7 +373,9 @@ private static Iterator<ManifestFile> buildManifest(
       FileIO io = new HadoopFileIO(conf.get());
       TaskContext ctx = TaskContext.get();
       String suffix =
-          String.format("stage-%d-task-%d-manifest", ctx.stageId(), ctx.taskAttemptId());
+          String.format(

Review Comment:
   We do need overwrite as it allows us to bypass if exists checks in some Hadoop FS implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105127479


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   Can you try doing it like Kafka does it?
   https://github.com/apache/spark/blob/cfd7ca9a06161f7622b5179a777f965c11892afa/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala#L96-L99
   No need to reset, just inject another task context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #6818: Spark 3.3: add_files data loss when SNAPSHOT_ID_INHERITANCE_ENABLED

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #6818:
URL: https://github.com/apache/iceberg/pull/6818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105085928


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addPartitionToPartitionedSnapshotIdInheritanceEnabled() {
+    createPartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)"
+            + "TBLPROPERTIES ('%s'='true')";
+
+    sql(createIceberg, tableName, TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED);

Review Comment:
   Is this required for the bug to trigger? It seems to me like it would be an issue even without this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105074612


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   We should add a test where addfiles is called twice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1106283191


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   @RussellSpitzer  updated the test to do two add_file calls and verify both partition data is available. kept the check for uuid pattern in manifest path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105120276


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -372,7 +373,9 @@ private static Iterator<ManifestFile> buildManifest(
       FileIO io = new HadoopFileIO(conf.get());
       TaskContext ctx = TaskContext.get();
       String suffix =
-          String.format("stage-%d-task-%d-manifest", ctx.stageId(), ctx.taskAttemptId());
+          String.format(

Review Comment:
   as we use overwrite when creating Avro writer https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L93
   
   https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L99
   
   I am testing and was thinking of creating a separate PR for this change as I don't think we need overwrite here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6818: Spark 3.3: add_files data loss when SNAPSHOT_ID_INHERITANCE_ENABLED

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1432517878

   Thanks, @abmo-x! Thanks for reviewing, @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on pull request #6818: Fix: add_files data loss when SNAPSHOT_ID_INHERITANCE_ENABLED

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1431837930

   @RussellSpitzer  Thanks, I removed the spark 3.2 changes. will create a separate pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on pull request #6818: Fix: add_files data loss when SNAPSHOT_ID_INHERITANCE_ENABLED

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1431831930

   > thank you @RussellSpitzer , I didn't realize it was also related to when `SNAPSHOT_ID_INHERITANCE_ENABLED` is set to true, where manifests didn't get rewritten with explicit snapshot ids
   
   @dramaticlly 
   yes, it only affects when the flag is true. Its mentioned in the reproduce steps in https://github.com/apache/iceberg/issues/6817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #6818: Fix: add_files data loss when SNAPSHOT_ID_INHERITANCE_ENABLED

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1431825613

   thank you @RussellSpitzer , I didn't realize it was also related to when `SNAPSHOT_ID_INHERITANCE_ENABLED` is set to true, where manifests didn't get rewritten with explicit snapshot ids  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105131647


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   Also even if we don't get a repo with two "add_files" let's add that as a test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105189453


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -372,7 +373,9 @@ private static Iterator<ManifestFile> buildManifest(
       FileIO io = new HadoopFileIO(conf.get());
       TaskContext ctx = TaskContext.get();
       String suffix =
-          String.format("stage-%d-task-%d-manifest", ctx.stageId(), ctx.taskAttemptId());
+          String.format(

Review Comment:
   We do need overwrite as it allows us to bypass some if exists checks in some Hadoop FS implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105116672


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   I tried that and it was kinda hard to reproduce as we need spark to generate the same stageId and taskAttemptId.
   
   I can change it into a integration test and try to reproduce it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105115220


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addPartitionToPartitionedSnapshotIdInheritanceEnabled() {
+    createPartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)"
+            + "TBLPROPERTIES ('%s'='true')";
+
+    sql(createIceberg, tableName, TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED);

Review Comment:
   yes, this happens only when its true, when the if condition is `false` the `copyManifest` uses [`newManifestOutput()` ](https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L260) which appends commitUUID to the file name making it unique.
   
   ```
   protected void add(ManifestFile manifest) {
       Preconditions.checkArgument(
           manifest.content() == ManifestContent.DATA, "Cannot append delete manifest: %s", manifest);
       if (snapshotIdInheritanceEnabled && manifest.snapshotId() == null) {
         appendedManifestsSummary.addedManifest(manifest);
         appendManifests.add(manifest);
       } else {
         // the manifest must be rewritten with this update's snapshot ID
         ManifestFile copiedManifest = copyManifest(manifest);
         rewrittenAppendManifests.add(copiedManifest);
       }
     }
   ```
   https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L244



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105115220


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addPartitionToPartitionedSnapshotIdInheritanceEnabled() {
+    createPartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)"
+            + "TBLPROPERTIES ('%s'='true')";
+
+    sql(createIceberg, tableName, TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED);

Review Comment:
   yes, this happens only when its true, when the if condition is `false` the `copyManifest` uses [`newManifestOutput()` ](https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L260) which appends commitUUID to the file name making it unique.
   
   https://github.com/apache/iceberg/blob/ef7e20e12eae2638015ff91b3356eb2f32f601cc/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L244
   ```
   protected void add(ManifestFile manifest) {
       Preconditions.checkArgument(
           manifest.content() == ManifestContent.DATA, "Cannot append delete manifest: %s", manifest);
       if (snapshotIdInheritanceEnabled && manifest.snapshotId() == null) {
         appendedManifestsSummary.addedManifest(manifest);
         appendManifests.add(manifest);
       } else {
         // the manifest must be rewritten with this update's snapshot ID
         ManifestFile copiedManifest = copyManifest(manifest);
         rewrittenAppendManifests.add(copiedManifest);
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105128374


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test

Review Comment:
   I think you just make a test method which doesn't call TaskContext.get and instead passes in a TaskContext



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105085598


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -394,6 +397,29 @@ public void addPartitionToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addPartitionToPartitionedSnapshotIdInheritanceEnabled() {
+    createPartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)"
+            + "TBLPROPERTIES ('%s'='true')";
+
+    sql(createIceberg, tableName, TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED);
+
+    scalarSql(
+        "CALL %s.system.add_files('%s', '`parquet`.`%s`', map('id', 1))",
+        catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    // verify manifest file name has uuid pattern
+    String manifestPath = (String) sql("select path from %s.manifests", tableName).get(0)[0];
+
+    Pattern uuidPattern = Pattern.compile("[a-f0-9]{8}(?:-[a-f0-9]{4}){4}[a-f0-9]{8}");
+
+    Matcher matcher = uuidPattern.matcher(manifestPath);
+    Assert.assertTrue("verify manifest path has uuid", matcher.find());

Review Comment:
   I think this tests is a bit circular, rather than testing that path has a uuid, we should test than multiple add_files do not corrupt the table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#discussion_r1105086523


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -372,7 +373,9 @@ private static Iterator<ManifestFile> buildManifest(
       FileIO io = new HadoopFileIO(conf.get());
       TaskContext ctx = TaskContext.get();
       String suffix =
-          String.format("stage-%d-task-%d-manifest", ctx.stageId(), ctx.taskAttemptId());
+          String.format(

Review Comment:
   Why doesn't this cause a "File Already Exists" exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6818: Fix: add_files data loss - old partitions get lost when new partition is added

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6818:
URL: https://github.com/apache/iceberg/pull/6818#issuecomment-1428958060

   The fix seems correct to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org