You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "dramaticlly (via GitHub)" <gi...@apache.org> on 2023/02/20 21:49:48 UTC

[GitHub] [iceberg] dramaticlly opened a new pull request, #6889: Spark 3.3: Do not check deleted file path for duplicate when import files to iceberg table

dramaticlly opened a new pull request, #6889:
URL: https://github.com/apache/iceberg/pull/6889

   close #6888 
   
   CC @szehon-ho @RussellSpitzer @abmo-x 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112474585


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   I guess there is another test called 'allowed' there, but that tests the case where the checkFlag is explicitly disabled 😅 so it has some signifiance there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116330343


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -534,7 +534,7 @@ private static void importUnpartitionedSparkTable(
                 .createDataset(Lists.transform(files, f -> f.path().toString()), Encoders.STRING())
                 .toDF("file_path");
         Dataset<Row> existingFiles =
-            loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+            loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES).filter("status!=2");

Review Comment:
   thank you @szehon-ho , added space in between and rebased



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly closed pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly closed pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files
URL: https://github.com/apache/iceberg/pull/6889


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116322789


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -534,7 +534,7 @@ private static void importUnpartitionedSparkTable(
                 .createDataset(Lists.transform(files, f -> f.path().toString()), Encoders.STRING())
                 .toDF("file_path");
         Dataset<Row> existingFiles =
-            loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+            loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES).filter("status!=2");

Review Comment:
   @dramaticlly sorry for all the nit pick comments, but can we do "status != 2" for readability (spaces between).
   
   Ref: Javadoc for Dataset.filter():   https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/sql/Dataset.html#filter-org.apache.spark.sql.Column-



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116232822


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +606,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())
               .toDF("file_path");
-      Dataset<Row> existingFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> entriesFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);

Review Comment:
   Style Nit: I think 'entriesFiles' doesnt make any sense.  
   
   And also to reduce code, can we use something like 
   ```
   loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES).filter("status != 2").
   ```
   
   Seems there is the API, but not 100% sure it works as expected in Java.  That would be the best option to me.  Else we could maybe just call the df entries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112360102


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +605,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())

Review Comment:
   you may want to add this change here as well for unpartitioned tables
   https://github.com/apache/iceberg/blob/84e390c5ca541bf85a9c29b4201443d7497fe17f/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L536
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#issuecomment-1448716938

   Merged, thanks @dramaticlly , and @hililiwei @abmo-x for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#issuecomment-1439176764

   seeing some random test failure on `:iceberg-spark:iceberg-spark-3.1_2.12:test`
   ```
           java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x
   ```
   
   seems irrelevant to my change, let me close and reopen to trigger another run


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1113432265


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {
+    createUnpartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg";
+
+    sql(createIceberg, tableName);
+
+    sql(
+        "CALL %s.system.add_files('%s', '`parquet`.`%s`')",
+        catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    String deleteData = "DELETE FROM %s";
+    sql(deleteData, tableName);
+
+    List<Object[]> result =
+        sql(
+            "CALL %s.system.add_files('%s', '`parquet`.`%s`')",
+            catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    assertEquals("Procedure output must match", ImmutableList.of(row(2L, 1L)), result);

Review Comment:
   hey @hililiwei , I think today the duplicate check is based on file path equality. Can you elaborate a bit more on why do you want to see the contents change of this file? 
   
   I think my unit test want to cover the case where add files followed by immediate delete will work. Before my change, this will fail with IllegalStateException as in https://github.com/apache/iceberg/issues/6888 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112387718


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   I think this is a good callout to verify the add-files behaviour for v2 in general, which I dont see any coverage in existing tests. 
   
   Focus on this particular PR, I think no matter v1 or v2, we shall not rely duplicate check based on file path only, the file path with deleted status will go away on next snapshot, so the current workaround before this fix is simply force a new snapshot between DELETE and add-files procedure call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112474834


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   Also optional: add a test for partitioned table.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho merged PR #6889:
URL: https://github.com/apache/iceberg/pull/6889


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116258324


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +606,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())
               .toDF("file_path");
-      Dataset<Row> existingFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> entriesFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);

Review Comment:
   thank you @szehon-ho , updated to chain the filter in one line per your feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112472215


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   Nit: on test name, I think 'allowed' is fairly redundant and we can remove it?  I think all of the tests check if what they test is allowed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112349858


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +605,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())
               .toDF("file_path");
-      Dataset<Row> existingFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> entriesFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> existingFiles = entriesFiles.filter(entriesFiles.col("status").notEqual(2));

Review Comment:
   I wanted to use `ManifestEntry.Status.DELETED` instead but looks like the class is not public per https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestEntry.java#L31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1113588429


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   thank you @szehon-ho , updated per your feedback and added test for partitioned table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116232822


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +606,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())
               .toDF("file_path");
-      Dataset<Row> existingFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> entriesFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);

Review Comment:
   Style Nit: I think 'entriesFiles' doesnt make any sense.
   
   And also to reduce code, can we use something like 
   ```
   loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES).filter("status != 2").
   ```
   
   Seems there is the API, but not 100% sure it works as expected in Java.  That would be the best option to me.  Else we could maybe just call the dataset "entries"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hililiwei commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "hililiwei (via GitHub)" <gi...@apache.org>.
hililiwei commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112590628


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {
+    createUnpartitionedFileTable("parquet");
+
+    String createIceberg =
+        "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg";
+
+    sql(createIceberg, tableName);
+
+    sql(
+        "CALL %s.system.add_files('%s', '`parquet`.`%s`')",
+        catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    String deleteData = "DELETE FROM %s";
+    sql(deleteData, tableName);
+
+    List<Object[]> result =
+        sql(
+            "CALL %s.system.add_files('%s', '`parquet`.`%s`')",
+            catalogName, tableName, fileTableDir.getAbsolutePath());
+
+    assertEquals("Procedure output must match", ImmutableList.of(row(2L, 1L)), result);

Review Comment:
   Should we test test cases where the contents of this file have changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1112362213


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -103,6 +103,35 @@ public void addDataUnpartitioned() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void deleteAndAddBackAllowed() {

Review Comment:
   wondering if there will be any additional scenarios we want to test for, especially how this would behave wrt merge-on-read as paths will be duplicates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6889: Spark 3.3: Skip duplicate check on deleted file path when import files

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on code in PR #6889:
URL: https://github.com/apache/iceberg/pull/6889#discussion_r1116258324


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -605,7 +606,8 @@ public static void importSparkPartitions(
           filesToImport
               .map((MapFunction<DataFile, String>) f -> f.path().toString(), Encoders.STRING())
               .toDF("file_path");
-      Dataset<Row> existingFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);
+      Dataset<Row> entriesFiles = loadMetadataTable(spark, targetTable, MetadataTableType.ENTRIES);

Review Comment:
   thank you @szehon-ho , updated per your feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org