You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/07 16:13:43 UTC

[GitHub] [iceberg] rdblue commented on issue #1286: Slow parallel operations fail to commit

rdblue commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670594342

I want to clarify here that the amount of data rewritten during a compaction should not matter very much. In many cases, the initial commit will fail because an operation can take a long time. What matters is how long a retry takes because retries are metadata-only operations.

For operations like compaction, Iceberg needs to rewrite existing manifests to remove files that were compacted. Any filtered and rewritten manifest is cached so that a retry doesn't never needs to rewrite the same manifest file twice. Iceberg will also use manifest file metadata to avoid even scanning manifests that cannot contain the files it is replacing. In most cases with a Spark streaming job appending to a table, we would expect the new manifests to not require scanning and for old manifests to be unchanged by the appends. Then all of the initial manifest rewrite work can be reused in a retry and it should proceed quickly enough to commit within the 10s interval. (The minimum amount of work to commit is writing a manifest list and a metadata JSON file, which should be well under 10s, even with S3.)

I think what we need to do to debug this case is to find out what work the retries are doing. In our environment, we can log file system operations, so we can see what files are being created in each attempt and how long these are taking. Can someone try to reproduce the issue and attach the log from the compaction so we can see what is happening that causes the retry to take so long?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org