You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/12 07:48:38 UTC

[GitHub] [iceberg] nastra opened a new pull request, #6569: Spark: Add the query ID to file names

nastra opened a new pull request, #6569:
URL: https://github.com/apache/iceberg/pull/6569

   Co-authored-by: Ryan Blue <bl...@apache.org>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1068373480


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   https://github.com/apache/iceberg/blob/8c6adf6e5e17603025d23b2012aa576c071ff269/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L90 shows how the file name is being determined, and in the cases where the data file was overwritten, `partitionId / taskId / operationId` were all the same (since we manually set the same `operationId` as for data files - previously the `operationId` for data+delete files was randomly generated).
   
   Maybe we could add a different suffix into the name generation to indicate that it's a data/delete file (although I'm not sure if there are any implications to this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071924681


##########
core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java:
##########
@@ -143,12 +153,17 @@ public Builder format(FileFormat newFormat) {
       return this;
     }
 
+    public Builder suffix(String newSuffix) {
+      this.suffix = newSuffix;
+      return this;
+    }
+
     public OutputFileFactory build() {
       LocationProvider locations = table.locationProvider();
       FileIO io = table.io();
       EncryptionManager encryption = table.encryption();
       return new OutputFileFactory(
-          defaultSpec, format, locations, io, encryption, partitionId, taskId, operationId);
+          defaultSpec, format, locations, io, encryption, partitionId, taskId, operationId, suffix);

Review Comment:
   most/all of the tests that extend `SparkRowLevelOperationsTestBase` were failing when reading data (after delete files have been written)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071925326


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   yes correct, they were using different output file factories



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071586466


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   @nastra, why wasn't the `fileCount.incrementAndGet()` taking care of the problem? Were the two paths using different output file factories or the same one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1067894522


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   @rdblue just FYI since you had this in your initial version. I have not added this to the `deleteFileFactory` below because otherwise delete writers are overwriting the data files (due to having the exact same file name)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#issuecomment-1385719707

   Thanks for picking this up, @nastra! Looks great.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071926775


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,10 +335,14 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())
+              .suffix("data")

Review Comment:
   sure, I've removed all `.suffix("data")` from the code so that we only have `.suffix("deletes")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071587694


##########
core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java:
##########
@@ -143,12 +153,17 @@ public Builder format(FileFormat newFormat) {
       return this;
     }
 
+    public Builder suffix(String newSuffix) {
+      this.suffix = newSuffix;
+      return this;
+    }
+
     public OutputFileFactory build() {
       LocationProvider locations = table.locationProvider();
       FileIO io = table.io();
       EncryptionManager encryption = table.encryption();
       return new OutputFileFactory(
-          defaultSpec, format, locations, io, encryption, partitionId, taskId, operationId);
+          defaultSpec, format, locations, io, encryption, partitionId, taskId, operationId, suffix);

Review Comment:
   Which tests caught the need for the suffix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue merged PR #6569:
URL: https://github.com/apache/iceberg/pull/6569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1068328104


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   This seems like an issue. The delete file writers must use a different `OutputFileFactory` and that's why the file counter is not unique?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071587465


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,10 +335,14 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())
+              .suffix("data")

Review Comment:
   Can we leave the suffix out for data files? That way we don't make names any longer and don't change the format for most of the names in the table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1071586945


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   Okay, I see below that they were using a different factory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on pull request #6569: Spark: Add the query ID to file names

Posted by GitBox <gi...@apache.org>.

nastra commented on PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#issuecomment-1385046987

   @rdblue thanks for the review. I've removed all `.suffix("data")` calls so that we only have `.suffix("deletes")` now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org