You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "bryanck (via GitHub)" <gi...@apache.org> on 2023/04/01 23:59:58 UTC

[GitHub] [iceberg] bryanck opened a new pull request, #7263: Spark: broadcast table instead of file IO in rewrite manifests

bryanck opened a new pull request, #7263:
URL: https://github.com/apache/iceberg/pull/7263

   This PR changes the Spark rewrite manifest action to broadcast a `Table` instead of a `FileIO`. This issue with `FileIO` is that it implements `Closeable`, so when Spark is removing the broadcast, it automatically calls `close()` on it. This happens in Spark's MemoryStore class [here](https://github.com/apache/spark/blob/309c65a353489b70bf5135a01d1f159328eaa3f3/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L397).
   
   With `S3FileIO` and using the Apache HTTP client, that has the consequence of closing the connection pool, and can cause subsequent request failures if there is an attempt to use the FileIO object again. This currently only impacts the rewrite manifests procedure, as it is the only action that broadcasts a FileIO rather than a Table.
   
   The `SparkUtil.serializableFileIO()`is not longer used by anything, but was left in place as it is public.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on a diff in pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on code in PR #7263:
URL: https://github.com/apache/iceberg/pull/7263#discussion_r1156493961


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -360,25 +358,26 @@ private void deleteFiles(Iterable<String> locations) {
         .noRetry()
         .suppressFailureWhenFinished()
         .onFailure((location, exc) -> LOG.warn("Failed to delete: {}", location, exc))
-        .run(fileIO::deleteFile);
+        .run(location -> table.io().deleteFile(location));
   }
 
   private static ManifestFile writeManifest(
       List<Row> rows,
       int startIndex,
       int endIndex,
-      Broadcast<FileIO> io,
+      Broadcast<Table> tableBroadcast,
       String location,
       int format,
       Types.StructType combinedPartitionType,
       PartitionSpec spec,
       StructType sparkType)
       throws IOException {
 
+    FileIO io = tableBroadcast.value().io();

Review Comment:
   Yes, sounds good, I made this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on PR #7263:
URL: https://github.com/apache/iceberg/pull/7263#issuecomment-1495299124

   Thanks Bryan, merging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bryanck commented on pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "bryanck (via GitHub)" <gi...@apache.org>.
bryanck commented on PR #7263:
URL: https://github.com/apache/iceberg/pull/7263#issuecomment-1496390534

   > Nice find, wondering, do you guys think we should deprecate the API, (and/or make note in javadoc)?
   
   Yes I think so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 merged pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 merged PR #7263:
URL: https://github.com/apache/iceberg/pull/7263


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #7263:
URL: https://github.com/apache/iceberg/pull/7263#issuecomment-1496313557

   Nice find, wondering, do you guys think we should deprecate the API, (and make note in javadoc)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #7263: Spark: broadcast table instead of file IO in rewrite manifests

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #7263:
URL: https://github.com/apache/iceberg/pull/7263#discussion_r1156491757


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -360,25 +358,26 @@ private void deleteFiles(Iterable<String> locations) {
         .noRetry()
         .suppressFailureWhenFinished()
         .onFailure((location, exc) -> LOG.warn("Failed to delete: {}", location, exc))
-        .run(fileIO::deleteFile);
+        .run(location -> table.io().deleteFile(location));
   }
 
   private static ManifestFile writeManifest(
       List<Row> rows,
       int startIndex,
       int endIndex,
-      Broadcast<FileIO> io,
+      Broadcast<Table> tableBroadcast,
       String location,
       int format,
       Types.StructType combinedPartitionType,
       PartitionSpec spec,
       StructType sparkType)
       throws IOException {
 
+    FileIO io = tableBroadcast.value().io();

Review Comment:
   since `io` is only used once, I think we can just put it inline at L380?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org