You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/06 09:28:04 UTC

[GitHub] [hudi] govorunov opened a new issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

govorunov opened a new issue #3756:
URL: https://github.com/apache/hudi/issues/3756


   Hi,
   
   I read all the documentation and FAQ and got a feeling Hudi is (almost) the right tool for what I'm trying to build, still unable to design the right solution:
   
   We need to build a  temporal representation of data stored in some database, i.e. snapshot of a database table that also stores the history of all changes to that table and provides means to query table state at different points in time.  Hudi answers almost all the questions:
   
   -  we can query 'point in time' using option("as.of.instant",...)
   -  ability to do incremental queries and query changes for a certain 'time span' only.
   
   However, it seems like this mechanism is based on '_hoodie_commit_time' column of the table, which represents the moment in time when data was written into Hudi table.  But in our case, not all changes are happening now - there are older versions of the database (backups) we need to insert into the datastore at the proper point in time - months, years old, and be able to query these versions using 'point in time' queries., as well as see data from these older versions in the current snapshot.  The temporal component in this case is not 'now', but rather part of the payload itself (DataFrame column or even option value).  Is there a way of bulk-inserting records into Hudi table at some 'point-in-time' other than 'now'? Ideally, while real-time changes are also ingested with proper timestamps?
   
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] govorunov commented on issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

Posted by GitBox <gi...@apache.org>.

govorunov commented on issue #3756:
URL: https://github.com/apache/hudi/issues/3756#issuecomment-940832604

Sorry, I'm quite new to big data so may ask some stupid questions. Let's forget about temporal storage, database backups etc. for a minute. Can we use Hudi to store all database events without significant write amplification and without making assumptions about nature of the data itself?

I mean imagine raw stream of data change events from CDC or something else - transaction log - long append-only table. Can we have this with Hudi effectively? Because what I've seen now while experimenting Hudi would create a complete copy of entire partition (gigabytes to terabytes depending how we partitioned) every time few new rows are added or modified. It does not matter COW or MOR - former would create a copy instantly, while the latter would do this every few minutes on compaction step. And what I need is the ability to append records to the table indefinitely, without write amplification, partition data by creation date and shedule compaction only after all the data for current day has been ingested. Hystorical querying is not needed here as the data is append-only. Can we shedule MOR compactions to run once a day instead every few minutes as it is now to reduce write amplification? Once we have tables storing complete transaction log we may think about derived tables.

Thanks!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] govorunov closed issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

Posted by GitBox <gi...@apache.org>.

govorunov closed issue #3756:
URL: https://github.com/apache/hudi/issues/3756


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] refset commented on issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

Posted by GitBox <gi...@apache.org>.

refset commented on issue #3756:
URL: https://github.com/apache/hudi/issues/3756#issuecomment-940232664


   Hi @govorunov, I was just taking a look at Hudi myself, so I'm certainly no expert, but I think you are looking for "bitemporal" as-of queries where `commit time` (aka `transaction time`) and `valid time` are indexed and queried independently, e.g. see [this documentation page from XTDB](https://xtdb.com/articles/bitemporality.html). Although - full disclosure - I work on XTDB and can say that XT's current architecture is not designed to handle PBs of data efficiently/cheaply without significant userspace sharding. However, we have a new storage & query architecture in the works that could get us a lot closer to that PB level...but perhaps still not at the same scale that Hudi already operates.
   
   From my very brief look at Hudi's design, I would be surprised if there isn't some possible combination to index your own valid time construct using derived tables and query with that efficiently (if inelegantly) via Spark / Presto etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] govorunov commented on issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

Posted by GitBox <gi...@apache.org>.

govorunov commented on issue #3756:
URL: https://github.com/apache/hudi/issues/3756#issuecomment-937267572


   I think I need to elaborate a little further:
   
   1. If we are to write all database backups into Hudi table in their historical order, then do the live database snapshot and only then start consuming new changes, then all the events will be written into Hudi table in their proper chronological order, although useless as all the dates will be off - events will appear by the time they were written into Hudi table and not the time of the event itself.
   2.  If we are to partition Hudi table by the date of event, then we are able to query time ranges properly, but then we are simply getting all the events. To do a 'point in time' query we'd have to query all historical data and then combine duplicate events by their 'event time'. It is possible although slow and what is the reason for using Hudi at all as we can do the same with bare parquet.
   
   If I am asking for a use case Hudi was not intended to handle, can someone maybe suggest the right tool for me, because I've been looking into temporal databases for quite some time already and still cannot find a solution capable to organize and query data in historical order and capable of storing large volumes of data (petabytes of it)?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] govorunov commented on issue #3756: [SUPPORT] Can we use Hudi to build Temporal Datastore?

Posted by GitBox <gi...@apache.org>.

govorunov commented on issue #3756:
URL: https://github.com/apache/hudi/issues/3756#issuecomment-937267572


   I think I need to elaborate a little further:
   
   1. If we are to write all database backups into Hudi table in their historical order, then do the live database snapshot and only then start consuming new changes, then all the events will be written into Hudi table in their proper chronological order, although useless as all the dates will be off - events will appear by the time they were written into Hudi table and not the time of the event itself.
   2.  If we are to partition Hudi table by the date of event, then we are able to query time ranges properly, but then we are simply getting all the events. To do a 'point in time' query we'd have to query all historical data and then combine duplicate events by their 'event time'. It is possible although slow and what is the reason for using Hudi at all as we can do the same with bare parquet.
   
   If I am asking for a use case Hudi was not intended to handle, can someone maybe suggest the right tool for me, because I've been looking into temporal databases for quite some time already and still cannot find a solution capable to organize and query data in historical order and capable of storing large volumes of data (petabytes of it)?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org