You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/23 23:15:37 UTC

[GitHub] [hudi] tsolanki95 opened a new issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

tsolanki95 opened a new issue #1867:
URL: https://github.com/apache/hudi/issues/1867


   Received the following error using the default installation of Hudi in EMR 5.29.0 (Hudi version 5.0.0):
   `RetryInvocationHandler: Exception while invoking ConsistencyCheckerS3FileSystem.open over numm. Retrying after sleeping for 35000ms. com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: eTag in metadata for File '<s3 path>/.hoodie_partition_metadata' does not match eTag from S3!`
   
   This is typically happening due to eTag verification in emrfs consistent view, which verifies that for a file on s3, we are using the latest version of the file (based on the eTag stored in dynamoDB table. We posed this question on [stack overflow](https://stackoverflow.com/questions/63052142/error-while-emrfs-consistency-view-enabled-along-with-hudi) and saw someone commented that this happens when you are writing files without using emrfs, but rather with standard AWS-SDK. Is current hudi implementation working on emrfs consistent view (a solution we put in earlier to overcome S3 eventual consistency issues in spark)? If so, do we need to disable `fs.s3.consistent.metadata.etag.verification.enabled`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663414676


   @umehrot2 : Can you help answer this question. Thanks.
   Balaji.V


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] absognety commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
absognety commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-931199668


   @tsolanki95 what resolved this issue, I am facing the same issue when reading data written in hudi format from S3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bschell commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
bschell commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-666775983


   @tsolanki95 As mentioned, it would be good to know the steps that you take to encounter this issue. Is this consistently reproducible? Does it resolve on retry? Otherwise it might be best to open a ticket with AWS EMR Support.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tsolanki95 edited a comment on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
tsolanki95 edited a comment on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   @luffyd We put in consistent view as a solution earlier, based on AWS support, to solve issues with using spark with S3 eventual consistency model causing duplicates in our data. We are now looking towards changing some of our datasets to utilize hudi but our compute resources still utilize EMRFS consistent view. As part of the transition, when some of our datasets utilize hudi and some do not, it would be good to be able to run spark with hudi on EMRFS consistent view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tsolanki95 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
tsolanki95 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   We put in consistent view as a solution earlier, based on AWS support, to solve issues with using spark with S3 eventual consistency model. We are now looking towards changing some of our datasets to utilize hudi but our compute resources still utilize EMRFS consistent view. As part of the transition, when some of our datasets utilize hudi and some do not, it would be good to be able to run spark with hudi on EMRFS consistent view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-668061087


   @tsolanki95 : This would be best addressed by opening a ticket with EMR support. Closing this ticket. Please reopen if this is specific to hudi.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tsolanki95 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
tsolanki95 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663698781


   This is also a field where data quality, precision, and accuracy are important. EMRFS consistent view helps us keep us having issues with s3 consistency, some of the features that hudi provides with rollback capabilities, and auditing and tracking changes made to our table are incredibly powerful for helping find and isolate data quality errors and rollback and rerun data with fixed input data/code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] luffyd commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
luffyd commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663659622


   @tsolanki95 Does this happen at the time read? In my tests, I noticed etags are not being in sync for .hoodie folder.
   Also what are your reasons to enable consistent view when using hudi.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663732545


   @tsolanki95 have you tried using `hoodie.consistency.check.enabled` which is Hudi's in-built mechanism for avoiding `eventual consistency` issues instead ?
   
   As for this particular issue with `EmrFS consistent view` are these temporary errors which resolve on retrying or is it causing the job to fail ? Yes disabling `fs.s3.consistent.metadata.etag.verification.enabled` could be a way ahead if this is blocking you while EMR team can try investigating this issue.
   
   cc @bschell who actually worked on the etag feature in EmrFS. Do you see any obvious cause for this ? Else, we can possibly have them open a ticket to AWS EMR support and investigate from there.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663734319


   Also on a side note, we always recommend using latest EMR releases as it has latest fixes and version of applications. So you may want to use `emr-5.30.1` instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1867:
URL: https://github.com/apache/hudi/issues/1867


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tsolanki95 edited a comment on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

Posted by GitBox <gi...@apache.org>.
tsolanki95 edited a comment on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   @luffyd We put in consistent view as a solution earlier, based on AWS support, to solve issues with using spark with S3 eventual consistency model. We are now looking towards changing some of our datasets to utilize hudi but our compute resources still utilize EMRFS consistent view. As part of the transition, when some of our datasets utilize hudi and some do not, it would be good to be able to run spark with hudi on EMRFS consistent view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org