You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/20 17:58:56 UTC

[GitHub] [iceberg] parasj commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

parasj commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1359913918

   Thanks for looking into this @singhpk234. The benchmark is Section 5 from the [TPC-DS spec](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf). There isn't a need to review this most likely since I can share the specific query that causes an issue (MERGE INTO aka MergeIntoIcebergTable).
   
   If I use the default `fs.s3.maxConnections` value, I receive the `Timeout waiting for connection from pool` error. Following [EMR documentation](https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/), I increase that value to at least 400 which resolves the error. However, task runtime increases substantially on the 9th or 10th MERGE INTO iteration.
   
   This is the query plan for the slow MERGE operation
   ![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-SQL-execution-2022-12-20-09_52_35](https://user-images.githubusercontent.com/453850/208733531-d8a4967c-a1aa-40d2-af4f-eb7966466972.png)
   
   Looking at the relevant job, we can see that a single worker is creating an issue. However, this issue occurs consistently across many different EMR clusters, so this is not caused by a bad worker.
   
   ![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-stages-stage-2022-12-20-09_54_28](https://user-images.githubusercontent.com/453850/208734202-0e051612-ac44-4347-a5dc-b1b63161c420.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org