You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/07/11 08:54:04 UTC

[GitHub] [iceberg] shidayang opened a new issue, #5245: Optimize the performance of MOR on Trino

shidayang opened a new issue, #5245:
URL: https://github.com/apache/iceberg/issues/5245

   My has done a chbenchmark of iceberg on trino. I found that the performance of MOR is very low when have many delete files. The scale of data is 10 warehouse.  The average duration is less than 10 second when no have delete files, but when I add some delete file to every table some query spent over one houre.
   
   
   1. #5195 The Trino every page will call DeleteFilter#filter, every calling of DeleteFilter#filter will initialize delete files.
   2. #5244 #5242 We found that the cost of creating StructLikeWrapper and InternalRecordWrapper is high.
   this is Flame Graph:
   <img width="1410" alt="image" src="https://user-images.githubusercontent.com/26699250/178226456-9e953b2b-5154-4693-9b74-2ec9f277fd97.png">
   
   
   
   The query performance improved when we made these optimizations. such as the query "select count(*) from stock", before optimize spent 8 minutes, after optimize only spent 20 seconds.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #5245: Optimize the performance of MOR on Trino
URL: https://github.com/apache/iceberg/issues/5245


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findinpath commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
findinpath commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1180161027

   Please see https://github.com/trinodb/trino/pull/13112 which is very likely related to your problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
shidayang commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1180147464

   @jackye1995 @chenjunjiedada @rdblue cc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
shidayang commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1180303897

   @findinpath I think this problem is not only for Trino. Iceberg core should be responsible for delete file loading once


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lhofhansl commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
lhofhansl commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1180501391

   I think Trino is special here as it operated a page at a time. Spark does not have this problem. And, as state in the PR, caching the filters outright unconditionally could be detrimental on Spark (increase memory usage)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1182011357

   Thanks for working on this, @shidayang! The PRs are all resolved so I'm going to close this now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
shidayang commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1181376110

   @flyrain About running 5 minutes TPCC on 10 warehouse of chbenchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #5245: Optimize the performance of MOR on Trino

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #5245:
URL: https://github.com/apache/iceberg/issues/5245#issuecomment-1180706273

   Hi @shidayang, how many delete files were there in your test?
   
   I did benchmark multiple delete files, you can see the result here https://github.com/apache/iceberg/pull/3287#issuecomment-960433304.
   ```
   with 25% rows are deleted and distribute these deletes to 1, 2, 5, 10 delete files
   ```
   The perf doesn’t degrade much with more delete files. Please be ware that non-vectorized read is using the path without caching the filter. I am guessing Trino could be different from Spark in terms of read pattern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org