You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/14 20:45:01 UTC
[GitHub] [iceberg] fbocse commented on issue #1496: Update version-hint.txt atomically

fbocse commented on issue #1496:
URL: https://github.com/apache/iceberg/issues/1496#issuecomment-708650031


   @pvary thank you for following up on this - I think that the solution you are considering reproduces the behaviour we are seeing for the server-side implementation of `create(overwrite=true)` used by Iceberg to override the `version-hint.txt` file. I don't think that in our particular case this is necessarily going to be a benefit at the "expense" of a couple of more hdfs requests.
   
   However your thoughts got me thinking too, especially with the new fallback behaviour in place. I wouldn't necessarily advertise using more APIs than necessary to implement optimistic locking because we're also exposing more edge-cases, I was thinking about this one in particular, accounting for your proposal:
   
   -1. writer reads version-hint.txt (with content 17)
   0. writer create(overwrite=false) v18.metadata.json
   1. writer create(overwrite=false) file version-hint-18.txt (with content 18)
   2. writer deletes version-hint.txt (with content 17)
   3. writer moves version-hint-18.txt to version-hint.txt  (with content 18)
   
   If between 2 and 3 a NEW writer attempts to commit it will fail to load `version-hint.txt` and it will fallback to directory listing instead - then it finds v18.metadata.json as the greatest version and continues to write v19.metadata.json and to commit its new version to the version-hint.txt file.
   If it so happens that the NEW writer also manages to commit its own 2 and 3 steps (unlikely but not impossible, think GC pause?) before the initial writer does we might end up with version 19 being overridden by version 18. This is a classic dead-lock.
   This basically "locks" the table for any subsequent writers - cause new writers will find version-hint.txt and load value 18, increment that value to resolve the version they should write a new metadata file for but fail to do so since there already is a v19.metadata.json file.
   
   I think that this behaviour is present today should the client exit unexpectedly right before replacing version-hint.txt. 
   But to that edge-case this implementation adds an edge case of a different category, it relates to the distributed nature of writers attempting to replace the same resource.
   
   Also looking at your suggestion I think that if any writer exits unexpectedly before step 3 I assume version-hint.txt file is never going to be materialized and all subsequent writers will fallback to directory listing for version resolution, right? I assume that resolving the version-hint.txt file doesn't involve also to write the version to the file as well.
   
   My suggestion is we'd also extract the current implementation of `HadoopTableOperations:writeVersionHint` and `HadoopTableOperations:readVersionHint` to something we can override by using the APIs that provide the right atomic and consistency guarantees - say dir listing provides read-after-write guarantees and we also get atomic create(overwrite=false) guarantees we can implement version resolution on top of those - each new version is a new version file, every writer attempts to delete older versions than say last 10 so we keep dir listing in constant time


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org