You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "mgorsk1 (via GitHub)" <gi...@apache.org> on 2023/02/13 07:57:13 UTC

[GitHub] [iceberg] mgorsk1 opened a new issue, #6821: Iceberg rewrite files parallel execution

mgorsk1 opened a new issue, #6821:
URL: https://github.com/apache/iceberg/issues/6821

   ### Feature Request / Improvement
   
   We are using Spark.get.rewriteDataFiles action to compact files in partitioned tables. I've noticed that, although it should be possible to parallelize such task (taking into account that the table is partitioned and rewriting files happens on partition-level in such case) it's not really happening. From what I've observed such action is handled by single executor (according to my dynamicAllocation settings it could be up to 10) sequentially and takes awfully long time - 6300 files across 100 partitions are compacted in nearly 30 minutes (27 to be precise). 
   
   This issue proposes considering introducing parallelization for this action to speed the process up - as for example happens when using action to remove orphan files.
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mgorsk1 commented on issue #6821: Iceberg rewrite files parallel execution

Posted by "mgorsk1 (via GitHub)" <gi...@apache.org>.
mgorsk1 commented on issue #6821:
URL: https://github.com/apache/iceberg/issues/6821#issuecomment-1428071909

   Thanks @RussellSpitzer indeed I must've missed it. I used this and after increasing it increased the number of executors. The improvement I see that currently:
   - with table has thousands of very small files across few partitions, and let's assume each file has size < 10000 bytes
   - if MAX_FILE_GROUP_SIZE_BYTES_DEFAULT is set to > 10000 bytes
   - spark will spawn many executors (up to MAX_CONCURRENT_FILE_GROUP_REWRITES) but no compaction will be done
   
   so some pre-spawn executor job that also takes this into account would be useful, wdyt?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6821: Iceberg rewrite files parallel execution

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6821:
URL: https://github.com/apache/iceberg/issues/6821#issuecomment-1676155298

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6821: Iceberg rewrite files parallel execution

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #6821:
URL: https://github.com/apache/iceberg/issues/6821#issuecomment-1427956046

   This feature already exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6821: Iceberg rewrite files parallel execution

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #6821:
URL: https://github.com/apache/iceberg/issues/6821#issuecomment-1428074307

   Sorry I don't know what you mean


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg rewrite files parallel execution [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6821:
URL: https://github.com/apache/iceberg/issues/6821#issuecomment-1880238820

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg rewrite files parallel execution [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6821: Iceberg rewrite files parallel execution
URL: https://github.com/apache/iceberg/issues/6821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org