You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/14 15:34:59 UTC

[GitHub] [hudi] kirkuz opened a new issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

kirkuz opened a new issue #1828:
URL: https://github.com/apache/hudi/issues/1828


   Hi Guys, 
   
   Is it possible to retain only last commit? When I put 'hoodie.cleaner.commits.retained': 1 in hudi_options I still have two last commits. One that is being written and the previous one. What I want to achieve is to have only last change and last parquet file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659376123


   Hi @umehrot2, I can see that it works as it should for now. I'm just wondering if it's possible to create hudi tables in Athena via AWS Glue crawler not only by running CREATE TABLE statement with hudi input format (mentioned here: https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658808083


   Thanks guys! I'll test it out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz edited a comment on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz edited a comment on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659959113


   @umehrot2 are there any chances that it will be supported soon? Is it planned? Can I help with that somehow?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-668664859


   @kirkuz : Kindly reach out to AWS support directly. I am closing this ticket for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659501450


   > btw. do you have any community slack channel?
   
   Please add your id to https://github.com/apache/hudi/issues/143 and we will add you to slack 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659959113


   @umehrot2 are there any chances that it will be supported soon? Is it planned somehow? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659631425


   > Hi @umehrot2, I can see that it works as it should for now. I'm just wondering if it's possible to create hudi tables in Athena via AWS Glue crawler not only by running CREATE TABLE statement with hudi input format (mentioned here: https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
   
   @kirkuz AWS Glue does not officially support Hudi. So this may not be possible right now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658790661


   Hi @bvaradar thanks for that. Does it mean that it was released on AWS yesterday? Should I use the latest EMR cluster release?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658587504


   Hi @bhasudha, thanks for this information. Now, it's clear how it's working. My use case is as following: I want to have only last change in parquet files, because when I try to read it with AWS Athena it sees doubled record with different _hoodie_commit. In ideal world I want to have one S3 bucket with only last change (to not force users to deduplicate it in AWS Athena query) and the second bucket with all commits to have whole history. 
   
   Can you recommend me sth? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658282320


   > Hi Guys,
   > 
   > Is it possible to retain only last commit? When I put 'hoodie.cleaner.commits.retained': 1 in hudi_options I still have two last commits. One that is being written and the previous one. What I want to achieve is to have only last change and last parquet file.
   
   @kirkuz  providing some context. Cleaning and compaction happen in the background (asynchronous to ingestion itself). When the cleaner kicks in it would get rid of the older commit. If there is an ongoing write, generally there could be two possibilities - 
   1. the write would succeed. in which case based on `hoodie.cleaner.commits.retained` the cleaner would get rid of the old version when it triggers.
   2. the write would fail for some reason - in this case the cleaner would later get rid of the failed commit and retain the other  version (which is the last succeeded one)
   
   This is why you are seeing two commits. This should not affect the queries. Can you please elaborate on what you were looking for in terms of use case/ performance concern etc to help us understand better ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1828:
URL: https://github.com/apache/hudi/issues/1828


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659028374


   @kirkuz yes the AWS Athena support was just released yesterday. Please try out the official support and if you face this issue open a support case with AWS Support, and ping on this thread as well. I will try to get someone from Athena to check it out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658794018


   Yes, @kirkuz . ccing @umehrot2 who can also chime in


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
kirkuz commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-659443232


   btw. do you have any community slack channel?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658764479


   @kirkuz : AWS Athena support for Hudi is just out : https://aws.amazon.com/about-aws/whats-new/2020/07/amazon-athena-adds-support-querying-apache-hudi-datasets-amazon-s3-based-data-lake/
    With this your query should not see any duplicate records. The duplicate records could only happen if the table is not defined properly with the correct Input-format.  The reason behind keeping at-least 1 previous version is to prevent queries from failing when concurrent write is happening. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org