You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2024/04/26 03:39:08 UTC

[PR] Spark: Use radix trees when collecting files to driver [iceberg]

amogh-jahagirdar opened a new pull request, #10229:
URL: https://github.com/apache/iceberg/pull/10229

   Still testing to see the efficacy, but I think for orphan file removal if there's a large list of files which will have common prefixes on the driver it will be more memory efficient to use a compressed trie (radix tree). This can help prevent driver OOMs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spark: Use compressed trie for storing set of files to remove on driver for orphan files [iceberg]

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #10229:
URL: https://github.com/apache/iceberg/pull/10229#discussion_r1581262173


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -398,13 +399,13 @@ static List<String> findOrphanFiles(
     spark.sparkContext().register(conflicts);
 
     Column joinCond = actualFileIdentDS.col("path").equalTo(validFileIdentDS.col("path"));
-
-    List<String> orphanFiles =
+    PatriciaTrie<String> files = new PatriciaTrie<>();

Review Comment:
   We don't really leverage the efficient search provided by tries for orphan files; rather I'm looking into the memory saving aspects of this structure when prefixes are repeated many times. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spark: Use compressed trie for storing set of files to remove on driver for orphan files [iceberg]

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #10229:
URL: https://github.com/apache/iceberg/pull/10229#discussion_r1581079152


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -398,12 +399,14 @@ static List<String> findOrphanFiles(
     spark.sparkContext().register(conflicts);
 
     Column joinCond = actualFileIdentDS.col("path").equalTo(validFileIdentDS.col("path"));
-
-    List<String> orphanFiles =
-        actualFileIdentDS
-            .joinWith(validFileIdentDS, joinCond, "leftouter")
-            .mapPartitions(new FindOrphanFiles(prefixMismatchMode, conflicts), Encoders.STRING())
-            .collectAsList();
+    PatriciaTrie<String> files = new PatriciaTrie<>();
+    actualFileIdentDS
+        .joinWith(validFileIdentDS, joinCond, "leftouter")
+        .mapPartitions(new FindOrphanFiles(prefixMismatchMode, conflicts), Encoders.STRING())
+        .foreach(
+            path -> {
+              files.put(path, null);

Review Comment:
   Foreach on this local reference won't work, in the end we have to collect it at which point all the string paths are materialized in memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spark: Use compressed trie for storing set of files to remove on driver for orphan files [iceberg]

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #10229:
URL: https://github.com/apache/iceberg/pull/10229#discussion_r1581081749


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -398,12 +399,14 @@ static List<String> findOrphanFiles(
     spark.sparkContext().register(conflicts);
 
     Column joinCond = actualFileIdentDS.col("path").equalTo(validFileIdentDS.col("path"));
-
-    List<String> orphanFiles =
-        actualFileIdentDS
-            .joinWith(validFileIdentDS, joinCond, "leftouter")
-            .mapPartitions(new FindOrphanFiles(prefixMismatchMode, conflicts), Encoders.STRING())
-            .collectAsList();
+    PatriciaTrie<String> files = new PatriciaTrie<>();
+    actualFileIdentDS
+        .joinWith(validFileIdentDS, joinCond, "leftouter")
+        .mapPartitions(new FindOrphanFiles(prefixMismatchMode, conflicts), Encoders.STRING())
+        .foreach(
+            path -> {
+              files.put(path, null);

Review Comment:
   I think there's an API for an iterator on the results, may be worth trying that and adding to the structure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spark: Use compressed trie for storing set of files to remove on driver for orphan files [iceberg]

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #10229:
URL: https://github.com/apache/iceberg/pull/10229#discussion_r1581496884


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -398,13 +399,13 @@ static List<String> findOrphanFiles(
     spark.sparkContext().register(conflicts);
 
     Column joinCond = actualFileIdentDS.col("path").equalTo(validFileIdentDS.col("path"));
-
-    List<String> orphanFiles =
+    PatriciaTrie<String> files = new PatriciaTrie<>();

Review Comment:
   We also do a listing of files on the driver, I'll explore using the structure there as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org