You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/15 10:42:58 UTC

[GitHub] [iceberg] rbalamohan opened a new pull request, #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

rbalamohan opened a new pull request, #6432:
URL: https://github.com/apache/iceberg/pull/6432

   Issue: https://github.com/apache/iceberg/issues/6387
   
   When tables are updated in "merge-on-read" mode, it creates positional delete files. Performance of reads degrades quite a bit, even with 4+ positional delete files (I tried with tpcds queries).
   
   Depending on workload, data file may have to read multiple "positional delete" files to construct delete positions. This does not sound expensive, but when large number of medium sized files are present in a partition, combinedfiletask ends up with many files. So a task has to process the data files in sequential fashion and every data file reads multiple positional delete file causing slowness.
   
   PR uses "ParallelIterable" in "Deletes::toPositionIndex". RoaringBitMap isn't threadsafe and hence added sync in BitmapPositionDeleteIndex. Tried out in local cluster and confirmed that this is not expensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Consider moving to ParallelIterable in Deletes::toPositionIndex [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1758777424

   I think we should use a thread pool but I think the implementation should be changed a bit. I explain [here](https://docs.google.com/document/d/1M4L6o-qnGRwGhbhkW8BnravoTwvCrJV8VvzVQDRJO5I).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1055834074


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +162,17 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {
+    PositionDeleteIndex positionDeleteIndex = new BitmapPositionDeleteIndex();

Review Comment:
   I don't think this should duplicate the logic of the other method. It should still concatenate into a `CloseableIterable`. But if there is more than one `CloseableIterable` it should use a `ParallelIterable` to do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1059916781


##########
core/src/main/java/org/apache/iceberg/SystemProperties.java:
##########
@@ -41,7 +41,10 @@ private SystemProperties() {}
 
   public static final int IO_MANIFEST_CACHE_MAX_FILEIO_DEFAULT = 8;
 
-  static boolean getBoolean(String systemProperty, boolean defaultValue) {
+  public static final String DELETE_POS_FILES_THREADS_ENABLED =

Review Comment:
   Using sys properties similar to worker threads in ThreadPools codebase. Making engine option wasn't straight forward via Deletes codebase. It may be easier to start with sys properties (similar to worker threads) and move to other mode if needed later?.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1386579067

   Addressed review comments in latest PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1424698512

   @rbalamohan, looks like this is causing tests to fail consistently. Can you look into the memory issue?
   
   FYI @nastra.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1051684443


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {

Review Comment:
   Yes, I understand the purpose of the PR. Here, there should be a check so that ParallelIterable is only used when there is more than one delete file to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1060863547


##########
core/src/main/java/org/apache/iceberg/util/ThreadPools.java:
##########
@@ -56,6 +56,22 @@ public static ExecutorService getWorkerPool() {
     return WORKER_POOL;
   }
 
+  /**
+   * Returns delete worker pool for Positional Deletes.
+   *
+   * <p>Size of this thread-pool is controlled by System Property {@code
+   * iceberg.delete.pos.read-worker-pool}.
+   *
+   * @return {@link ExecutorService} that is delete worker pool
+   */
+  public static ExecutorService newDeleteWorkerPool() {

Review Comment:
   If this is going to behave like the worker pool, then it should also be a static, shared pool. Otherwise, this will create multiple pools that have the same name.
   
   It should either behave like the existing `getWorkerPool` or like the existing `newWorkerPool` that requires a name.



##########
core/src/main/java/org/apache/iceberg/util/ThreadPools.java:
##########
@@ -56,6 +56,22 @@ public static ExecutorService getWorkerPool() {
     return WORKER_POOL;
   }
 
+  /**
+   * Returns delete worker pool for Positional Deletes.
+   *
+   * <p>Size of this thread-pool is controlled by System Property {@code
+   * iceberg.delete.pos.read-worker-pool}.
+   *
+   * @return {@link ExecutorService} that is delete worker pool
+   */
+  public static ExecutorService newDeleteWorkerPool() {

Review Comment:
   If this is going to behave like the worker pool, then it should also be a static, shared pool. Otherwise, this will create multiple pools that have the same name.
   
   It should either behave like the existing `getWorkerPool` or like the existing `newWorkerPool` that requires a name and a size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1059915065


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +142,17 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  static ExecutorService getDeletePosThreadPool() {

Review Comment:
   Addressed this in recent PR. Moved it to ThreadPools.newDeleteWorkerPool().



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1060864920


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -49,6 +53,8 @@ public class Deletes {
   private static final Accessor<StructLike> POSITION_ACCESSOR =
       POSITION_DELETE_SCHEMA.accessorForField(MetadataColumns.DELETE_FILE_POS.fieldId());
 
+  private static ExecutorService deletePosThreadPool = ThreadPools.newDeleteWorkerPool();

Review Comment:
   I think that this default pool should be managed by ThreadPools like the worker pool, rather than using a ThreadPools factory method to create them. See my comments in that class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1060864469


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +141,11 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  @VisibleForTesting
+  static void resetDeletePosThreadPool() {
+    deletePosThreadPool = ThreadPools.newDeleteWorkerPool();

Review Comment:
   I think it is better to allow tests to pass in a custom pool that is used and cleaned up rather than adding something that replaces the static pool.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1375173330

   Changes:
   
   1. "Delete threadpool" is managed in "ThreadPools" class. This is very similar to worker thread pool. Controlled by "iceberg.delete-pos.worker.num-threads" system property. Again this is very similar to worker threadpool.
   
   2. When delete files are > 1 & delete threadpool is available, "Deletes::toPositionIndex" makes use of parallelIterable. 
   
   3. Testcase explicitly sets threadpool for test purposes alone. It does not interfere with ThreadPools.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1410946484

   The changes look good, but Spark OOMed. I kicked off the tests to run again. This could be something to fix since the parallelism will require copying delete rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1051680991


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {

Review Comment:
   Thanks @rdblue. Yes, this happens when there are more than one "delete positional file" that qualifies for the data file. E.g Assume a trickle feed job ingests data into the partition. Due to late arriving data, another job updates the dataset for certain dataset in the partition & creates "positional files (POS)".  For update jobs with different criteria, same data file may get qualified and creates additional POS files.  Essentially during scanning, one data file may have to scan multiple POS files (e.g 4 pos files) and causes slowness. ParallelIterable helps in this case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1102290538


##########
core/src/main/java/org/apache/iceberg/SystemProperties.java:
##########
@@ -41,6 +41,9 @@ private SystemProperties() {}
 
   public static final int IO_MANIFEST_CACHE_MAX_FILEIO_DEFAULT = 8;
 
+  public static final String DELETE_POS_FILES_THREAD_POOL_SIZE =

Review Comment:
   The name of the variable seems to be specific to position deletes while the rest of the names are not. Is that deliberate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Consider moving to ParallelIterable in Deletes::toPositionIndex [iceberg]

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1793809960

   #8805 was merged so I'll close this. I should also note that @aokolnychyi raised some concerns about this approach instead of a more comprehensive fix. This is probably a good start if we don't add further caching.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rbalamohan (via GitHub)" <gi...@apache.org>.
rbalamohan commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1425090000

   I will check making the default threads to 4 instead of "number of processors".  In larger systems, users have the flexibility to change it via system property.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1072965803


##########
core/src/main/java/org/apache/iceberg/SystemProperties.java:
##########
@@ -41,7 +41,10 @@ private SystemProperties() {}
 
   public static final int IO_MANIFEST_CACHE_MAX_FILEIO_DEFAULT = 8;
 
-  static boolean getBoolean(String systemProperty, boolean defaultValue) {
+  public static final String DELETE_POS_FILES_THREAD_POOL_SIZE =
+      "iceberg.delete-pos.worker.num-threads";

Review Comment:
   How about staying with the `iceberg.worker` prefix and adding `iceberg.worker.delete-num-threads` or something? Is that good?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1386282157

   Thanks, @rbalamohan! It's close, just a couple more things to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Consider moving to ParallelIterable in Deletes::toPositionIndex [iceberg]

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue closed pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex 
URL: https://github.com/apache/iceberg/pull/6432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1051677178


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {

Review Comment:
   Can you have this check whether there is more than one iterable? There's no need to use ParallelIterable if there's only one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1055831616


##########
core/src/main/java/org/apache/iceberg/SystemProperties.java:
##########
@@ -41,7 +41,10 @@ private SystemProperties() {}
 
   public static final int IO_MANIFEST_CACHE_MAX_FILEIO_DEFAULT = 8;
 
-  static boolean getBoolean(String systemProperty, boolean defaultValue) {
+  public static final String DELETE_POS_FILES_THREADS_ENABLED =

Review Comment:
   I think this should be an engine option rather than a System property. @aokolnychyi what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1356891084

   I think this looks good. I like how small the change is. Do you think it would be easy to also add a config flag to enable/disable this? I think it technically violates Spark's threading model (where tasks use a single thread) but it does seem useful as an option that can be controlled by admins.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1059915385


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +142,17 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  static ExecutorService getDeletePosThreadPool() {
+    return (SystemProperties.getBoolean(SystemProperties.DELETE_POS_FILES_THREADS_ENABLED, false))
+        ? ThreadPools.newWorkerPool(SystemProperties.DELETE_POS_FILES_THREADS_ENABLED)
+        : null;

Review Comment:
   Addressed this in recent PR. Moved it to ThreadPools.newDeleteWorkerPool



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1712958775

   @deniskuzZ, you don't necessarily need to turn off container reuse. You can also clone matching items, which is what we do for `DataFile` instances during job planning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1052031104


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {

Review Comment:
   Addressed it in the recent PR. Also, made it as a configurable option via sys properties. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1072965362


##########
core/src/main/java/org/apache/iceberg/SystemProperties.java:
##########
@@ -41,7 +41,10 @@ private SystemProperties() {}
 
   public static final int IO_MANIFEST_CACHE_MAX_FILEIO_DEFAULT = 8;
 
-  static boolean getBoolean(String systemProperty, boolean defaultValue) {
+  public static final String DELETE_POS_FILES_THREAD_POOL_SIZE =
+      "iceberg.delete-pos.worker.num-threads";
+
+  public static boolean getBoolean(String systemProperty, boolean defaultValue) {

Review Comment:
   It doesn't look like this needs to be exposed as public by this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1425211289

   @rbalamohan, do you have the same position files that are read over and over again for different data files in a combined scan task? Or is it mostly unique delete files per each data file?
   
   I am a bit concerned about using a thread pool on executors. Let me take a look with fresh eyes tomorrow. I wonder whether we can cache the result of reading a particular delete file (like bitmap or whatever is constructed) and reuse that instead of parallelizing reading deletes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] deniskuzZ commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "deniskuzZ (via GitHub)" <gi...@apache.org>.
deniskuzZ commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1519993615

   > @rbalamohan, do you have the same position files that are read over and over again for different data files in a combined scan task? Or is it mostly unique delete files per each data file?
   > 
   > I am a bit concerned about using a thread pool on executors. Let me take a look with fresh eyes tomorrow. I wonder whether we can cache the result of reading a particular delete file (like bitmap or whatever is constructed) and reuse that when the same delete file must be read for different data files instead of parallelizing reading deletes.
   
   @aokolnychyi , if you'll have time please take a look at https://github.com/apache/iceberg/pull/6527 which introduces the caching for positional deletes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rbalamohan (via GitHub)" <gi...@apache.org>.
rbalamohan commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1704410065

   It will be helpful, if someone can take it up.  Fixing this will help to some extent on merge on read, but for production usecases MOR very easily degrades in performance.  I am not working on this anymore @rdblue . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1704352562

   @rbalamohan, can you get tests working? I think this is a good thing to get into the next release if we can.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1055831895


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +142,17 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  static ExecutorService getDeletePosThreadPool() {

Review Comment:
   Iceberg doesn't use `get` in method names. You can either omit it or use a more specific verb.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1055832837


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +142,17 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  static ExecutorService getDeletePosThreadPool() {
+    return (SystemProperties.getBoolean(SystemProperties.DELETE_POS_FILES_THREADS_ENABLED, false))
+        ? ThreadPools.newWorkerPool(SystemProperties.DELETE_POS_FILES_THREADS_ENABLED)
+        : null;

Review Comment:
   Should this be in ThreadPools like the worker pool? I think setting pool size there like we do for the worker pool makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1072967562


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -135,6 +141,11 @@ public static StructLikeSet toEqualitySet(
     }
   }
 
+  @VisibleForTesting
+  static void setDeletePosThreadPool(ExecutorService threadPool) {

Review Comment:
   This is okay, but I think it is better to be able to set the threadpool when running an operation that uses one instead of modifying shared state.
   
   How about changing `toPositionIndex` like this?
   
   ```java
     public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
         CharSequence dataLocation, List<CloseableIterable<T>> deleteFiles) {
       return toPositionIndex(dataLocation, deleteFiles, ThreadPools.getDeleteWorkerPool());
     }
   
     public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
         CharSequence dataLocation, List<CloseableIterable<T>> deleteFiles, ExecutorService threadPool) {
       ...
       if (positions.size() > 1 && threadPool != null) {
         return toPositionIndex(new ParallelIterable<>(positions, deletePosThreadPool));
       } else {
         return toPositionIndex(CloseableIterable.concat(positions)));
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1059915558


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +162,17 @@ public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex toPositionIndex(List<CloseableIterable<Long>> positions) {
+    PositionDeleteIndex positionDeleteIndex = new BitmapPositionDeleteIndex();

Review Comment:
   Refactored this in the recent PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1102289279


##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -137,14 +140,24 @@ public static StructLikeSet toEqualitySet(
 
   public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
       CharSequence dataLocation, List<CloseableIterable<T>> deleteFiles) {
+    return toPositionIndex(dataLocation, deleteFiles, ThreadPools.getDeleteWorkerPool());
+  }
+
+  public static <T extends StructLike> PositionDeleteIndex toPositionIndex(
+      CharSequence dataLocation,
+      List<CloseableIterable<T>> deleteFiles,
+      ExecutorService deletePosThreadPool) {
     DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
     List<CloseableIterable<Long>> positions =
         Lists.transform(
             deleteFiles,
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(
+        (positions.size() > 1 && (deletePosThreadPool != null))

Review Comment:
   I find this ternary operator hard to read. Can we switch to an explicit if else statement?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] deniskuzZ commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "deniskuzZ (via GitHub)" <gi...@apache.org>.
deniskuzZ commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1706404942

   ParallelIterable as it's currently designed can't be used with any of the record readers when reuseContainers is set to true. Got inconsistent data during reading when incorporating this patch downstream.
   
   This is the problematic place:
   ````
   for (T item : iterable) {
       queue.add(item);
   }
   ````
   As containers are reused we'll get just duplicate entries in a queue. We should either clone them in ParallelIterable or disable the reuseContainers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1712958858

   @aokolnychyi do you think this is something that will help or are your fixes going to alleviate the need for it?
   
   FYI @amogh-jahagirdar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org