You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/02 16:15:54 UTC

[GitHub] [incubator-hudi] jvaesteves opened a new issue #1585: [SUPPORT] Delete Hudi commit history

jvaesteves opened a new issue #1585:
URL: https://github.com/apache/incubator-hudi/issues/1585


   Hello everyone, I am currently testing Hudi as a deduplication mecanism for a streaming project, and it is working pretty good. But as I do not have any update to any row, keeping previous versions of the same row is just wasting S3 space. 
   
   I want know if it is possible to just keep the most recent version of my table, or if it is possible to schedule a deletion of this history (and how would I do that).
   
   **Environment Description**
   
   - Hudi version: 0.5.2
   - Spark version : 2.4.4
   - Hive version : 2.3.6
   - Hadoop version : 2.8.5
   - Storage (HDFS/S3/GCS..) : S3
   - Running on Docker? (yes/no) : No


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on issue #1585: [SUPPORT] Delete Hudi commit history

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1585:
URL: https://github.com/apache/incubator-hudi/issues/1585#issuecomment-623965327


   Try setting hoodie.cleaner.commits.retained=1 to keep the number of versions at minimum.
   
   Hudi has an option to filter out duplicate rows. For DeltaStreamer, use the flag "--filter-dupes --op INSERT". For Spark DataSource based writes, set the option hoodie.datasource.write.insert.drop.duplicates=true and hoodie.datasource.write.operation=insert 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] jvaesteves commented on issue #1585: [SUPPORT] Delete Hudi commit history

Posted by GitBox <gi...@apache.org>.
jvaesteves commented on issue #1585:
URL: https://github.com/apache/incubator-hudi/issues/1585#issuecomment-625296439


   Thanks for the tip @bvaradar, it worked perfectly


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org