You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/08 01:35:45 UTC

[GitHub] [iceberg] kbendick commented on issue #5453: Issue after migrating to Spark 3.3.0 and Iceberg 14.0

kbendick commented on issue #5453:
URL: https://github.com/apache/iceberg/issues/5453#issuecomment-1207554438

The stack trace reads like there's an S3 request timeout.

Can you provide the following infirmation?

1. The exact Iceberg runtime jar dependency used (ensure that you're using the spark 3.3 iceberg bundle
2. The catalog are you using (eg hadoop, hive, DynamoDB, etc).
3. The Spark configuration used, including configuration settings for initializing the Iceberg catalog plus any non-default Spark configs applied.

Have you confirmed that this same code over _the same input data_ works with your previous Spark 3.2 settings? And not necessarily previous runs, but this _same_ input data? Given that there seems to be an S3 upload timeout, it would be really helpful to compare the previous setup to the new setup over the exact same dataset to be able to truly compare the two. Otherwise it's hard to be sure that the problem isn't simply input skew of your data (eg the input dataset is much larger than normal, it's much more skewed on the columns being sorted on and thus takes much more time to sort, etc).

It would also be _very_ helpful to provide the Query plan from your old setup and from the new setup (either the output of `EXPLAIN EXTENDED` or a screenshot of the whole DAG for the Query from the SQL tab for both the old Spark 3.2 with Iceberg 0.13 and for the new Spark 3.3 with Iceberg 0.14).

Assuming that you're using the correct Spark iceberg runtime JAR for Spark 3.3.0, I'm wondering if maybe Spark's adaptive execution is lowering the parallelism of the write stage, which is then increasing the size of the data upload to S3 and leading to the timeout.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org