You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/03 12:39:45 UTC

[GitHub] [iceberg] davseitsev opened a new issue #1286: Slow parallel operations fail to commit

davseitsev opened a new issue #1286:
URL: https://github.com/apache/iceberg/issues/1286

I have a situation when two parallel processes modify single table.

First process is Spark Structured Streaming query which reads from Kafka and continuously appends the table with trigger period 10s.

Another process is continuous compaction which works in a following way:
1. List files in the latest partition and take **N** GB of data. Small files have bigger priority.
2. Run Spark job which reads collected files and produce 1 big file located in partition path.
3. Atomically replace small files with bi one like this:
```
table.newRewrite()
.rewriteFiles(compactingFiles, resultFiles)
.commit()
```

This approach stop working when partition grew and `RewriteFiles` operation becomes slow. When `SnapshotProducer` tries to commit rewrite operation it fails with exception:

> Base metadata location 'db_path/the_table/metadata/01334-c9c69f57-eb55-4e34-bd5e-beeab380c10c.metadata.json' is not same as the current table metadata location 'db_path/the_table/metadata/metadata/01335-1070b870-5ea3-4dc6-9708-493b724ee8f1.metadata.json' for db_path.the_table

Then it tries to obtain the latest state of the table and commit it again. But first streaming process has already appended the table and commit fails again. After 3 failed attempts compaction job fails.

In my opinion the problem is that `HiveTableOperations` removes table lock between successive attempts. It allows concurrent streaming process to append some new data between the attempts and it causes the problem. Keeping the table locked between commit attempts would work.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] davseitsev commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

davseitsev commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668071699


   > You mean the spark streaming job will commit to iceberg table for every 10 seconds ? If sure, then it seems we committed iceberg too frequently and it's easy to cause transaction conflicts.
   
   Because of specific of Spark Structured Streaming in micro batch mode rare triggering (let's say once every 5m) will cause spikes in resource usage. I'd like to keep small batches if it's posible.
   
   > Or just write to the data file without committing the transaction and after some another interval to commit the transaction.
   
   I'm not sure how to implement this with Spark Streaming. I would appriciate if you can point me to the pice of documentation or source code where I can read about it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-672279713


   https://drive.google.com/file/d/1BEgSY2xbYMgmQBgL7SYCLe2pW-gkOPq3/view?usp=drivesdk
   
   Sorry, please retry with this link for the log.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670816890

At a glance of codebase, looks like applying the changes into current base snapshot is executed per retrial. The way to build manifests/snapshot looks to be always based on the operation, even Iceberg can also leverage the information about the delta of previous base snapshot vs new base snapshot when retrying.

We could do differently when snapshots after base snapshots all came from "fast append" operations.

Probably the cheapest approach would be allowing reorder of snapshots - insert the snapshot between base snapshot and the snapshot having base snapshot as parent. As we look to add a new snapshot to only the tail (append), so it is only viable if we are OK with breaking the policy.

Alternatively, we can list up manifests written from snapshots after base snapshots, and only add these files to manifest list file (it would be nice if we can simply append, but if not possible, read and merge into new file) in snapshot created in previous trial, and write metadata for the modified snapshot and commit.

Does it make sense? If it makes sense for us, I'll try to do with some POC.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR edited a comment on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR edited a comment on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-671884656


   So I added debug log messages - not fully understanding the details, so I roughly added time check for entry point, as well as log message on opening file for read or write.
   
   https://github.com/apache/iceberg/commit/5a0a131a76c98d417c8d6cb70947219010171420
   
   Here's the log I got with making the rewrite data action retried and finally failed.
   
   https://drive.google.com/file/d/1BEgSY2xbYMgmQBgL7SYCLe2pW-gkOPq3/view?usp=sharing
   
   So you're right that manifest files are not read per retrial - so the first trial and further trials are significant difference on elapsed time. Some of output manifest files (looks to be 100+) in first trial seem to be read in further trials (no cache seems to play here), which still make further trials to be around 20 seconds. (Note that it's from local filesystem and I expect higher latency in practice.)
   
   Btw, my suggestion was focused on the characteristic of "fast append". If I understand correctly, fast append only adds manifest and data files and doesn't play with existing manifest and data files, and that's how the commit phase can be done in hundreds of milliseconds.
   
   The situation what we encountered is that the rewrite data action got conflict with other commits and needs to retry, and all the commits in the meanwhile were fast append. Given the fact commits between retrial should not be so many, does it cost more if we look into which manifest files are added instead when added commits were all fast append? They shouldn't make conflict with the changes rewrite data action has made, based on the characteristic of "fast append", so adding them to manifest-list seems to be OK (not familiar with the spec of files so I may probably imagine the unrealistic one, so please correct me if I'm going wrong.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670972927

> looks like applying the changes into current base snapshot is executed per retrial

Yes, but work that has already been done is reused. If a table has 2 manifests, A and B, when a rewrite starts and another commit adds manifest C, then the retry won't need to filter A and B a second time. It will only filter C, and it will use metadata to determine if C needs to be rewritten. By reusing this work, reattempts should only require writing the manifest list and root metadata file.

> Probably the cheapest approach would be allowing reorder of snapshots

Reordering snapshots is not allowed because it changes history.

A reattempt should be a cheap operation, we just need the logs to know why it isn't in this case.

> Alternatively, we can list up manifests written from snapshots after base snapshots . . .

I don't quite understand what you're suggesting here, but it sounds very similar to what is already done. Most of the time, an append results in a new manifest added, so the situation I described above is how the commit is reattempted.

A couple things can go wrong:
1. Manifests are compacted: A, B, and C might be rewritten into D. If that happens, then D must be scanned and rewritten because A and B had to be. That takes time, which could result in the retry failing. The next retry would probably get manifests D and E, and the original situation (D is already done, E doesn't match) would apply for a quick retry.
2. New manifests must be scanned: If the metadata for C shows that it might contain files that were rewritten, then the commit must scan C to check for them. The metadata that we use is the range of partitions in a manifest. So if you're partitioning by hour, for example, then a compaction that rewrites data in the hour currently being written will need to scan all new manifests each retry.

The most likely situation is the second one, but logs that show what files are getting created will tell us what is happening. I think we should find out what is taking so long in the reattempt in order to plan how to fix it.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-671884656


   So I added debug log messages - not fully understanding the details, so I roughly added time check for entry point, as well as log message on opening file for read or write.
   
   https://github.com/apache/iceberg/commit/5a0a131a76c98d417c8d6cb70947219010171420
   
   Here's the log I got with making the rewrite data action retried and finally failed.
   
   https://drive.google.com/file/d/1BEgSY2xbYMgmQBgL7SYCLe2pW-gkOPq3/view?usp=sharing
   
   So you're right that manifest files are not read per retrial - so the first trial and further trials are significant difference on elapsed time. Some of output manifest files in first trial seem to be read in further trials (no cache seems to play here), which still make further trials to be around 20 seconds. (Note that it's from local filesystem and I expect higher latency in practice.)
   
   Btw, my suggestion was focused on the characteristic of "fast append". If I understand correctly, fast append only adds manifest and data files and doesn't play with existing manifest and data files, and that's how the commit phase can be done in hundreds of milliseconds.
   
   The situation what we encountered is that the rewrite data action got conflict with other commits and needs to retry, and all the commits in the meanwhile were fast append. Given the fact commits between retrial should not be so many, does it cost more if we look into which manifest files are added instead when added commits were all fast append? They shouldn't make conflict with the changes rewrite data action has made, based on the characteristic of "fast append", so adding them to manifest-list seems to be OK (not familiar with the spec of files so I may probably imagine the unrealistic one, so please correct me if I'm going wrong.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

openinx commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668018068


   > continuously appends the table with trigger period 10s.
   
   You mean the spark streaming job will commit to iceberg table for every 10 seconds ?  If sure, then it seems  we committed iceberg too frequently and it's easy to cause transaction conflicts.
   Or just write to the data file without committing the transaction and after some another interval to commit the transaction.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668358945


   > Or just write to the data file without committing the transaction and after some another interval to commit the transaction.
   
   If I understand correctly, this is already the thing we do while rewriting data files (at least Spark action is, but the custom implementation would be similar). The latter is a thing Iceberg already cares about during retrying commit.
   
   The problem occurs when the commit can't be done between micro-batch interval - in this case it always lose the race. Increasing the max number of retrial or interval on timeout wouldn't help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR edited a comment on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR edited a comment on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668021708


   This is pretty much similar with what I posted in dev@ mailing list.
   
   https://lists.apache.org/thread.html/rf6c800ee3ea580a6c4dce377c292e9bc47a74ff5b27ccb79323e6426%40%3Cdev.iceberg.apache.org%3E
   
   There's a reasonable answer from @rdblue so you'd like to read through it.
   
   https://lists.apache.org/thread.html/r6d321b83baebe33e07f1632b176316aeb354c5a4f0d6397f056460ba%40%3Cdev.iceberg.apache.org%3E
   
   And the answer was similar with @openinx provided, 10s is a high rate on Iceberg, so you might want to have a longer interval, or implement your own logic on metadata R/W which accelerates the metadata write.
   
   Btw, I'd agree that there might be another valid idea to "prioritize" the commit and give "prioritized commit" change to retry couple of times without losing lock. (The problem here is that it's only available on Hive catalog.) We may want to see maintenance operations to be prioritized and succeed whenever possible, or exactly opposite.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668021708


   This is pretty much similar with what I posted in dev@ mailing list.
   
   https://lists.apache.org/thread.html/rf6c800ee3ea580a6c4dce377c292e9bc47a74ff5b27ccb79323e6426%40%3Cdev.iceberg.apache.org%3E
   
   There's a reasonable answer from @rdblue so you'd like to read through it.
   
   https://lists.apache.org/thread.html/r6d321b83baebe33e07f1632b176316aeb354c5a4f0d6397f056460ba%40%3Cdev.iceberg.apache.org%3E
   
   And the answer was similar with @openinx provided, 10s is a high rate on Iceberg, so you might want to have a longer interval, or implement your own logic on metadata R/W which accelerates the metadata write.
   
   Btw, I'd agree that there might be another valid idea to "prioritize" the commit and give "prioritized commit" change to retry couple of times without losing lock. (The problem here is that it's only available on Hive catalog.) We may want to see maintenance operations succeed be prioritized (so that it can succeed whenever possible) or exactly opposite.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670594342

I want to clarify here that the amount of data rewritten during a compaction should not matter very much. In many cases, the initial commit will fail because an operation can take a long time. What matters is how long a retry takes because retries are metadata-only operations.

For operations like compaction, Iceberg needs to rewrite existing manifests to remove files that were compacted. Any filtered and rewritten manifest is cached so that a retry doesn't never needs to rewrite the same manifest file twice. Iceberg will also use manifest file metadata to avoid even scanning manifests that cannot contain the files it is replacing. In most cases with a Spark streaming job appending to a table, we would expect the new manifests to not require scanning and for old manifests to be unchanged by the appends. Then all of the initial manifest rewrite work can be reused in a retry and it should proceed quickly enough to commit within the 10s interval. (The minimum amount of work to commit is writing a manifest list and a metadata JSON file, which should be well under 10s, even with S3.)

I think what we need to do to debug this case is to find out what work the retries are doing. In our environment, we can log file system operations, so we can see what files are being created in each attempt and how long these are taking. Can someone try to reproduce the issue and attach the log from the compaction so we can see what is happening that causes the retry to take so long?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR edited a comment on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR edited a comment on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670816890

We could do differently when snapshots after base snapshots all came from "fast append" operations.
(I'm assuming we believe the information of "operation" for the snapshot information. If we have to read through manifest list files & manifest files then probably introduce more latency.)

Does it make sense? If it makes sense for us, I'll try to do with some POC.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] davseitsev commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

davseitsev commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668641386


   I looked through the RewriteDataFilesAction from latest release. It's really awesome that Iceberg has built-in compaction process, didn't see it in previous release. But when I ran it a few times, I got the same issue like here #1159.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668030858


   Btw, there's RewriteDataFilesAction in Spark action which does data file compaction, though there're some points to improve (https://github.com/apache/iceberg/issues/1159), like you do prioritize small files, and only pick N gb of files. 
   
   Probably restricting the size is a valid strategy to restrict the time to compact - if we can control this as fairly reasonable time, like couple of minutes, this can be enabled as part of streaming write, as auto compaction.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] davseitsev commented on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

davseitsev commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668078507


   Thanks @HeartSaVioR for the answer. I've looked through the mailing list. I understand all the concerns which @rdblue describes. I configured metadata and manifest files compaction to keep acceptable number of files. It should not be an issue. Also I apply snapshots expiration to keep only a few latest snapshots and cleanup old files.
   In my opinion the cause of slow commit operation is that I keep my table on Amazon S3. I like the idea to implement my own logic on metadata operations, will try to do that.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR edited a comment on issue #1286: Slow parallel operations fail to commit

Posted by GitBox <gi...@apache.org>.

HeartSaVioR edited a comment on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-668358945


   > Or just write to the data file without committing the transaction and after some another interval to commit the transaction.
   
   If I understand correctly, this is already the thing we do while rewriting data files (at least Spark action is, but the custom implementation would be similar). The latter is a thing Iceberg already cares about during retrying commit - you can configure max number of retry and interval on retrial to adjust the time to commit.
   
   The problem occurs when the commit can't be done between micro-batch interval - in this case it always lose the race. Increasing the max number of retry or interval on retrial wouldn't help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org