You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/27 15:02:31 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue, #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

RussellSpitzer opened a new issue, #5140:
URL: https://github.com/apache/iceberg/issues/5140

   Currently we can only control the concurrency of a RewriteDatafiles Action by setting a static option, 
   
   > By default the actions are executed serially, but can be run concurrently by increasing the value of **max-concurrent-file-group-rewrites**. This parameter controls the number of actions which will be run simultaneously.
   
   This makes it difficult to properly set the parameter when some partitions require multiple tasks to complete and others require only a single task. For example a FileGroup which ends up writing 10 files will require 10 tasks and 10 Spark Cores. Another File-group may only have a single output file requiring only a single task.
   
   Instead of using a static option, I think we should provide the option of allowing the Action to attempt to determine the number of open cores and schedule new jobs accordingly. This "auto" option would basically implement the following logic:
   
   ```
   while (fileGroupsLeft is not empty) {
     for (group in group) {
       if (group.tasks > coresFree) {
         schedule(group)
       }
     }
     If unable to schedule any groups since all (group.tasks > totalCores) {
       schedule(largestGroup)
     }
   }
   ```
   Basically we just always schedule jobs when we have open cores, if we have jobs that are all too large , then just schedule the largest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Zhangg7723 commented on issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by GitBox <gi...@apache.org>.
Zhangg7723 commented on issue #5140:
URL: https://github.com/apache/iceberg/issues/5140#issuecomment-1173429882

   Yes,the current design is coarse-grained, we are working on this point, I can do something for this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #5140:
URL: https://github.com/apache/iceberg/issues/5140#issuecomment-1396265364

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #5140:
URL: https://github.com/apache/iceberg/issues/5140#issuecomment-1179181076

   Glad to hear it, Let us know what you plan to do or if you have a PR for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #5140:
URL: https://github.com/apache/iceberg/issues/5140#issuecomment-1371568389

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action
URL: https://github.com/apache/iceberg/issues/5140


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ricardopereira33 commented on issue #5140: Add Automatic Concurrency Controls for Rewrite Datafiles Action

Posted by "ricardopereira33 (via GitHub)" <gi...@apache.org>.
ricardopereira33 commented on issue #5140:
URL: https://github.com/apache/iceberg/issues/5140#issuecomment-1507037365

   Hi @Zhangg7723!
   
   Do you have any updates on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org