You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hudi.apache.org by Ryan Murray <ry...@gmail.com> on 2020/08/12 17:34:35 UTC

S3 and eventual consistency

Hey all,

I've been playing around with Hudi for a little while now. Really like it!
Thanks for all the work :-)

I do have a question about S3 and consistency: How does Hudi get around
eventual consistency in S3? Particularly in the case of metadata files.

I can see there is a ConsistencyGuard[1] which ensures that the JVM Thread
its run in can see a path, however it isn't clear to me that this would be
valid across a system.

If a writer 'A' performs an action which requires a rename for example how
can we ensure that readers B and C see the newly renamed file? Or even that
nodes across reader B (eg a spark cluster) see the same file content?

To me this is checking if an object is visible from a particular thread
rather than checking the eventual consistency restrictions of S3[2]. People
have gone to great lengths to get around S3s consistency issues as well
[3][4].

Apologies if this is a naive question, I am still grappling with the Hudi
commit model.

Best,
Ryan

[1]
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/ConsistencyGuard.java
[2]
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
[3] https://github.com/Netflix/s3mper
[4] https://docs.delta.io/latest/delta-storage.html#amazon-s3

Re: S3 and eventual consistency

Posted by Ryan Murray <ry...@gmail.com>.
Thanks Balaji,

That makes a lot of sense. I haven't seen any issues in my testing, I am
just trying to understand all the edge cases.

I suppose the only theoretical issue is that a reader may not see the most
recent update for the writer but that would be a rare and transient
occurrence in real life.

Best,
Ryan

On Thu, Aug 13, 2020 at 5:38 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

> Hey Ryan,
>
> Thanks for the detailed writeup and great job explaining the question and
> the links :)
>
> W.r.t Renaming, Hudi avoids renaming metadata files altogether and creates
> immutable metadata filenames encoded with state of the commit.
>
> Generally, We believe some of the consistency solutions out there have
> been written in early days of S3 when the guarantees were not well
> estabilished/understood.
>
> S3 consistency guard in Hudi has been fairly battle-tested for a while by
> the community now in their production cluser. Are you seeing any specific
> issues in your setup ?
>
> Once again thanks for your interest in Hudi
>
> Balaji.V
> On Wednesday, August 12, 2020, 10:35:05 AM PDT, Ryan Murray <
> rymurr@gmail.com> wrote:
>
>
> Hey all,
>
> I've been playing around with Hudi for a little while now. Really like it!
> Thanks for all the work :-)
>
> I do have a question about S3 and consistency: How does Hudi get around
> eventual consistency in S3? Particularly in the case of metadata files.
>
> I can see there is a ConsistencyGuard[1] which ensures that the JVM Thread
> its run in can see a path, however it isn't clear to me that this would be
> valid across a system.
>
> If a writer 'A' performs an action which requires a rename for example how
> can we ensure that readers B and C see the newly renamed file? Or even that
> nodes across reader B (eg a spark cluster) see the same file content?
>
> To me this is checking if an object is visible from a particular thread
> rather than checking the eventual consistency restrictions of S3[2]. People
> have gone to great lengths to get around S3s consistency issues as well
> [3][4].
>
> Apologies if this is a naive question, I am still grappling with the Hudi
> commit model.
>
> Best,
> Ryan
>
> [1]
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/ConsistencyGuard.java
> [2]
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
> [3] https://github.com/Netflix/s3mper
> [4] https://docs.delta.io/latest/delta-storage.html#amazon-s3
>

Re: S3 and eventual consistency

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Hey Ryan,
Thanks for the detailed writeup and great job explaining the question and the links :)
W.r.t Renaming, Hudi avoids renaming metadata files altogether and creates immutable metadata filenames encoded with state of the commit.  
Generally, We believe some of the consistency solutions out there have been written in early days of S3 when the guarantees were not well estabilished/understood. 
S3 consistency guard in Hudi has been fairly battle-tested for a while by the community now in their production cluser. Are you seeing any specific issues in your setup ? 
Once again thanks for your interest in Hudi
Balaji.V    On Wednesday, August 12, 2020, 10:35:05 AM PDT, Ryan Murray <ry...@gmail.com> wrote:  
 
 Hey all,
I've been playing around with Hudi for a little while now. Really like it! Thanks for all the work :-)
I do have a question about S3 and consistency: How does Hudi get around eventual consistency in S3? Particularly in the case of metadata files.
I can see there is a ConsistencyGuard[1] which ensures that the JVM Thread its run in can see a path, however it isn't clear to me that this would be valid across a system. 
If a writer 'A' performs an action which requires a rename for example how can we ensure that readers B and C see the newly renamed file? Or even that nodes across reader B (eg a spark cluster) see the same file content? 
To me this is checking if an object is visible from a particular thread rather than checking the eventual consistency restrictions of S3[2]. People have gone to great lengths to get around S3s consistency issues as well [3][4].

Apologies if this is a naive question, I am still grappling with the Hudi commit model.
Best,Ryan  
[1] https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/ConsistencyGuard.java[2] https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel[3] https://github.com/Netflix/s3mper[4] https://docs.delta.io/latest/delta-storage.html#amazon-s3