You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/19 18:49:41 UTC

[GitHub] [iceberg] parasj opened a new issue, #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

parasj opened a new issue, #6456:
URL: https://github.com/apache/iceberg/issues/6456

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We are seeing substantially slower performance with Iceberg 1.1.0 MoR when compared to Iceberg 0.14.0 MoR over the TPC-DS refresh benchmark. Ideally, we expect MERGE latency to be lower for MoR versus CoW tables.
   
   A typical TPC-DS refresh merge takes:
   * Iceberg 0.14.0 MoR: all merges take an average of 128s
   * Iceberg 1.1.0 MoR: merges 1-9 take an average of 564s, merge 10 takes 12,151s
   
   It seems like we are encountering [this issue](https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/) with S3 connection pools which leads significant delays due to retries. Applying EMR's recommended fix avoids an exception but leads to a significant slowdown.
   
   We are using Spark 3.3 with EMR 6.9.0 across 16x i3.2xlarge workers and 1 i3.2xlarge head node. We are using the following Spark flags as recommended by EMR:
   ```
   ["spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
   "spark.sql.catalog.ice=org.apache.iceberg.spark.SparkCatalog",
   "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog",
   "spark.sql.catalog.spark_catalog.type=hive",
   "spark.sql.catalog.ice.io-impl=org.apache.iceberg.aws.s3.S3FileIO"]
   ```
   
   Why might 1.1.0 be so much slower?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1360077027

   It looks like for some reason, the splits created for the left side source are very skewed, and this skewness, as per my understanding is the main reason for slow down. plz. refer the min / 25th percentile / median / 75th take KB's of data where as Max has 100's of MB of data, and also spilling is happening for that task. 
   
   P.S : It would be really nice to see what was the distribution prior to 1.1 release, can you please also attach that.
   
   ![Screen Shot 2022-12-20 at 11 45 18 AM](https://user-images.githubusercontent.com/35593236/208753629-28e0c31a-8f14-4d1d-8f98-5cdbd4d552b2.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1596318570

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1617024341

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR
URL: https://github.com/apache/iceberg/issues/6456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] parasj commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by GitBox <gi...@apache.org>.
parasj commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1359913918

   Thanks for looking into this @singhpk234. The benchmark is Section 5 from the [TPC-DS spec](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf). There isn't a need to review this most likely since I can share the specific query that causes an issue (MERGE INTO aka MergeIntoIcebergTable).
   
   If I use the default `fs.s3.maxConnections` value, I receive the `Timeout waiting for connection from pool` error. Following [EMR documentation](https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/), I increase that value to at least 400 which resolves the error. However, task runtime increases substantially on the 9th or 10th MERGE INTO iteration.
   
   This is the query plan for the slow MERGE operation
   ![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-SQL-execution-2022-12-20-09_52_35](https://user-images.githubusercontent.com/453850/208733531-d8a4967c-a1aa-40d2-af4f-eb7966466972.png)
   
   Looking at the relevant job, we can see that a single worker is creating an issue. However, this issue occurs consistently across many different EMR clusters, so this is not caused by a bad worker.
   
   ![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-stages-stage-2022-12-20-09_54_28](https://user-images.githubusercontent.com/453850/208734202-0e051612-ac44-4347-a5dc-b1b63161c420.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1358858648

   can you please elaborate on `TPC-DS refresh benchmark` what are the queries used and share any literature around it, apologies didn't find much.
   
   Also can you please attach the spark plans for before vs after merge query plans, happy to take a look and debug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org