You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/17 09:42:02 UTC

[GitHub] [iceberg] vshel opened a new issue, #5997: Iceberg table maintenance/compaction within AWS

vshel opened a new issue, #5997:
URL: https://github.com/apache/iceberg/issues/5997

   ### Query engine
   
   Spark3
   
   ### Question
   
   Hello, I have a ~6TB iceberg table with ~10,000 partitions within S3 and I am using Glue catalog, what is the correct way of running compaction on such a table?
   
   From documentation: https://iceberg.apache.org/docs/latest/maintenance/ I can run:
   ```
   SparkActions
       .get()
       .rewriteDataFiles(table)
       .filter(Expressions.equal("date", "2020-08-18"))
       .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB
       .execute();
   ```
   This is going to execute on a single aws instance, how do I scale this to many instances for the compaction process to run in parallel on many partitions at once, is there an out of the box support for this?
   Additionally, the table is constantly updated, am I supposed to pause all updates until compaction finishes?
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ismailsimsek commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by GitBox <gi...@apache.org>.

ismailsimsek commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1289283783

   @vshel any reason you are not [using Athena to do compaction](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html)? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] vshel commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by GitBox <gi...@apache.org>.

vshel commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1290277795

   @ismailsimsek I tried running OPTIMIZE with athena on a partition with ~25 000 files totalling 2.6GB (so pretty small dataset), it failed with an internal error after 8 minutes, I created a support ticket for AWS to investigate, but it's not looking promising now, considering whole table dataset is 6TB.
   
   Additionally, after experimenting, Athena read performance is horrible unless I do a compaction, I tested a small 25MB dataset, it takes athena 50 seconds to get 100 000 records out of this iceberg 25MB table or to do a COUNT(*), and after I do compaction it takes 8 seconds for athena to do retrival and count operations.
   All files in the dataset have a corresponding delete, because I am doing upserts of streaming data. So, it looks like upserting (delete + write) slows down athena read performance, compaction fixes it as it removes deletes. I tested performances without deletes by doing just writes during streaming of this 25MB dataset and read performance was 8 seconds even without running compaction.
   
   So, Iceberg athena read performace is looking to be very slow, considering non-iceberg athena tables that span 60GB of data can run COUNT(*) in just 4 seconds, compared to Iceberg's 8 seconds for 25MB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #5997: Iceberg table maintenance/compaction within AWS

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #5997: Iceberg table maintenance/compaction within AWS
URL: https://github.com/apache/iceberg/issues/5997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Samrose-Ahmed commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by GitBox <gi...@apache.org>.

Samrose-Ahmed commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1312796748

   I would recommend running a Spark job. An AWS Glue job is the easiest to get started but considering you're running this once, it'll likely be cheaper to run on EMR (serverless or provisioned).  Also, Spark/EMR doesn't run on a single instance, it parallelizes across nodes.
   
   In the future, since you're doing streaming appends/inserts I would recommend doing regular table maintenance, so you don't end up in this situation. You can check this blog post : [Automated Iceberg table maintenance on AWS](https://www.matano.dev/blog/2022/11/04/automated-iceberg-table-maintenance) for how we do it in [Matano](https://github.com/matanolabs/matano), but its fairly simple you need to regularly run compaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1626379341

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1603464485

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] maswin commented on issue #5997: Iceberg table maintenance/compaction within AWS

Posted by GitBox <gi...@apache.org>.

maswin commented on issue #5997:
URL: https://github.com/apache/iceberg/issues/5997#issuecomment-1343898096

   You have to set this setting to a higher value (default is 1) to run re-writes in multiple instances in parallel:
   
   `max-concurrent-file-group-rewrites`
   
   [https://iceberg.apache.org/javadoc/1.1.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES](https://iceberg.apache.org/javadoc/1.1.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES)
   
   Not sure why this is not documented. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org