You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/27 16:33:04 UTC

[GitHub] [incubator-hudi] nandini57 opened a new issue #1569: [SUPPORT]

nandini57 opened a new issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569


   Is there a way/example to build audit queries within a partition path in COW/MOR mode.Sure,there are the commit times ,but due to inline compaction and cleanup(Last Commit), i don't get the view of the data before a delete or upsert back in time?
   
   I am trying to do, select * from hoodie_ro where _hoodie_commit_time in (<last 10 commits>) and get the view of data at each commit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620786548


   Great,thanks Vinoth.Is Murmurhash of my businesskeys a good choice then?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] vinothchandar commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620776648


   @nandini57 It just has to be ordered, increasing/decreasing does not matter.. can be non-contiguous.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 edited a comment on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620749380


   Thanks Balaji. Yesterday , i did change the parameter to retain 40 commits and changed the _hoodie_record_key to include my business batch id column along with one of the other columns. Instead of EmptyRecordPayload ,using a custom payload which will just add the records in each commit instead of removing from disk.The business batch id increments with every ingestion and i can audit based on commit time to have a view of data at a particular point in past.
   spark.sql("select * from hoodie_ro where cast(_hoodie_commit_time as long) <=" + Long.valueOf(commitTime)).show();
   
   Is it a good idea to conceive _hoodie_record_key as 123_1,123_4 .. or it has to be monotonically increasing to help indexing?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 edited a comment on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620749380


   Thanks Balaji. Yesterday , i did change the parameter to retain 40 commits and changed the _hoodie_record_key to include my business batch id column along with one of the other columns. Instead of EmptyRecordPayload ,using a custom payload which will just add the records in each commit instead of removing from disk.The business batch id increments with every ingestion and i can audit based on commit time to go back in time.
   spark.sql("select * from hoodie_ro where cast(_hoodie_commit_time as long) <=" + Long.valueOf(commitTime)).show();
   
   Is it a good idea to conceive _hoodie_record_key as 123_1,123_4 .. or it has to be monotonically increasing to help indexing?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bvaradar commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620319666


   With cleaning, you can setup cleaning parameters so that you have enough versions retained ( see https://hudi.apache.org/docs/configurations.html#retainCommits). You can also align the compaction runs with your audit schedule by keeping "hoodie.compact.inline.max.delta.commits" same as the number_of_delta_commits_before_audit to control compaction runs. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 edited a comment on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620749380


   Thanks Balaji. Yesterday , i did change the parameter to retain 40 commits and changed the _hoodie_record_key to include my business batch id column along with one of the other columns. Instead of OverwriteRecordPayload ,using a custom payload which will just add the records in each commit instead of removing from disk.The business batch id increments with every ingestion and i can audit based on commit time to have a view of data at a particular point in past.
   spark.sql("select * from hoodie_ro where cast(_hoodie_commit_time as long) <=" + Long.valueOf(commitTime)).show();
   
   Is it a good idea to conceive _hoodie_record_key as 123_1,123_4 .. or it has to be monotonically increasing to help indexing?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bvaradar commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620806109


   @nandini57, You can prefix with a timestamp like "<batch_timestamp_<hash_of_business_key>" to get ordering benefits. 
   From your description, it looks like you essentially want the table to be a log of all record changes and you are simply inserting new records and no updates are possible. Right ?  In this case,  you can simply use bulk-insert/insert APIs which would avoid record-key index lookups in the first place. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620824966


   Yes ,so far requirement is to keep all record changes.In future ,may need to upsert as well.Thanks guys for the help!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bvaradar commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-621600324


   Great. Thanks for using Hudi. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] nandini57 commented on issue #1569: [SUPPORT] Audit Feature In A PartitionPath

Posted by GitBox <gi...@apache.org>.

nandini57 commented on issue #1569:
URL: https://github.com/apache/incubator-hudi/issues/1569#issuecomment-620749380


   Thanks Balaji. Yesterday , i did change the parameter to retain 40 commits and changed the record key to include my business batch id column along with one of the other columns.The business batch id increments with every ingestion and i can audit based on batch id or commit time 
   spark.sql("select * from hoodie_ro where cast(_hoodie_commit_time as long) <=" + Long.valueOf(commitTime)).show();
   
   Is it a good idea to conceive record keys as 123_1,123_4 .. or it should be monotonically increasing to help indexing?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org