You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/28 07:02:12 UTC

[GitHub] [hudi] abhibhat98 opened a new issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

abhibhat98 opened a new issue #1675:
URL: https://github.com/apache/hudi/issues/1675


   **Describe the problem you faced**
   When I do an incremental query, I only get the latest event per key. I want to get all the events as a log.
   e,g 
   at time T1, key value as K1-V1
   at time T2, updated key value is K1-V2
   at time T3, updated key value is K1-V3
   
   When I do an incremental query between time 0(start) to T3, I only get K1-V3. Is there a way I can set maxCommits(I see that there's an option Setting fromCommitTime=0 and maxCommits=-1 will fetch the entire source table in HiveIncrementalPuller), so that I can stream all these events back from a certain time.
   
   As an example, if I ask incremental updates after T1+1, I'd get:
   K1-V2
   K1-V3
   
   I am able to get it using spark.read.parquet ... Is there a way I can get it from Hudi?
   
   The environment I am on is  EMR 6.0.0 on AWS with Hudi
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar closed issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

vinothchandar closed issue #1675:
URL: https://github.com/apache/hudi/issues/1675


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-637258553


   Are you able to use `listCommitsSince` and use the commit times there are begin and end pairwise? I.e if you get c1,c2,c3 then do c1,c2 and c2,c3 incremental queries? 
   
   https://github.com/apache/hudi/blob/742c2040995167871db976fc0eb280347401ffc4/hudi-spark/src/main/java/org/apache/hudi/HoodieDataSourceHelpers.java#L49
   
   Btw if you make this work, please consider writing a short blog post on the site. It will also help others


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-647912047


   @abhibhat98  were you able to progress on this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635634106


   @abhibhat98 Thanks for the thought provoking questions.. table history, is something we already support via the CLI tool..  
   
   >> Hudi has the history of everything, it can look up by times, why can't it look up by the key? Or, is it something by design that Hudi doesn't intend to do.
   Typically key value stores (HBase, bigtable) are able to do this, because they have an effective index to fetch keys out.. Hudi is slowly getting there (see RFC-08/RFC-15 if interested) and when we do have such means, we can start providing such lookups.. Today, if you don't care about performance you can just to what you did above with a `where clause _hoodie_record_key in (<list_of_keys_interested_in>)` 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar edited a comment on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

vinothchandar edited a comment on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635634106


   @abhibhat98 Thanks for the thought provoking questions.. table history, is something we already support via the CLI tool..  
   
   >> Hudi has the history of everything, it can look up by times, why can't it look up by the key? 
   
   Or, is it something by design that Hudi doesn't intend to do.
   Typically key value stores (HBase, bigtable) are able to do this, because they have an effective index to fetch keys out.. Hudi is slowly getting there (see RFC-08/RFC-15 if interested) and when we do have such means, we can start providing such lookups.. Today, if you don't care about performance you can just to what you did above with a `where clause _hoodie_record_key in (<list_of_keys_interested_in>)` 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635598789


   @abhibhat98  Thanks for reaching out. In short there is no direct API to support that use case in Hudi currently. This use case usually fits a K-V storage system that can return versions of a record when queried. Hudi provides the most recent version of a record within the time bounds specified int he query(if incremental) or the latest value if no time bound is specified. 
   
   However, this can be worked around by querying individual commits involved in the original incremental query and the results can be union-ed in the application side. For example, in your example above, if the original query specified 0-T3 as time bounds, you could get list of all commits that happened in this time and split the query based on those individual commits. So in this case it would be three queries 0 - T1, T1 - T2 and T2 - T3. These will get V1, V2 and V3 for K1 respectively. I also created a jirs - https://jira.apache.org/jira/browse/HUDI-976 to provide a utility tool that can do this. Would you be interested in taking that up?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-653256745


   Filed https://issues.apache.org/jira/browse/HUDI-1066 for the future


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] abhibhat98 commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

abhibhat98 commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635675628


   Thanks @vinothchandar  for a detailed peek into the design. I did this
   
   ` spark.sql("select * from test_123 where _hoodie_record_key = 'L1'").show`
   
   However, I only got the latest commit.  However when I do this:
   
   `
   spark.read.format("org.apache.hudi").
     option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY, DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     option(END_INSTANTTIME_OPT_KEY, endTime).
     load("s3://dip-abhatia-test/hudi_test1/data")
   `
   
   I get the earlier records. But I need begin and/or end time. If I don't care about performance(as its a one off job that fixes things or get all the data), is there a way to get it? I see that you cli has this - fromCommitTime=0 and maxCommits=-1  - as mentioned by you?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] abhibhat98 edited a comment on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

abhibhat98 edited a comment on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635675628


   Thanks @vinothchandar  for a detailed peek into the design. I did this
   
   ` spark.sql("select * from test_123 where _hoodie_record_key = 'L1'").show`
   
   However, I only got the latest commit.  However when I do this:
   
   `
   spark.read.format("org.apache.hudi").
     option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY, DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     option(END_INSTANTTIME_OPT_KEY, endTime).
     load("s3://dip-abhatia-test/hudi_test1/data")
   `
   
   I get the earlier records. But I need begin and/or end time. If I don't care about performance(as its a one off job that fixes things or get all the data), is there a way to get it? I see that you cli has this - fromCommitTime=0 and maxCommits=-1  - as mentioned by you but is it possible via spark ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] abhibhat98 edited a comment on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

abhibhat98 edited a comment on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635613375

Understood! Thanks Bhavani! I'd love to take this up. I'll research and get back to you on how to proceed.
Follow up question on above. How'd a consumer know the timings of the changed records? It asking a simple question, what changed for key K over the period of time (to recompute or even rollback something looking at the history).

Your explanation is absolutely perfect and I tried it too, with the different times(T1/T2/T3), it works seamlessly. So, should I consider it a feature that can be added(look at the history of a record) as its completely doable. Hudi has the history of everything, it can look up by times, why can't it look up by the key? Or, is it something by design that Hudi doesn't intend to do.

Or even table history would be good. As an example, Databricks delta has something called 'Describe history' which can give you the full history of the table, what changed over time and then you can rollback/time travel from those specific commit ids.

Thanks again, just trying to understand what questions are right in terms of Hudi design.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] abhibhat98 commented on issue #1675: [SUPPORT] Get all changed records from an incremental query rather than the latest one

Posted by GitBox <gi...@apache.org>.

abhibhat98 commented on issue #1675:
URL: https://github.com/apache/hudi/issues/1675#issuecomment-635613375


   Understood! Thanks Bhavani! I'd love to take this up. I'll research and get back to you on how to proceed.
   Follow up question on above. How'd a consumer know the timings of the changed records? It asking a simple question, what changed for key K over the period of time (to recompute or even rollback something looking at the history). 
   
   Your explanation is  absolutely perfect and I tried it too, with the different times(T1/T2/T3), it works seamlessly. So, should I consider it a feature that can be added(look at the history of a record) as its completely doable. Hudi has the history of everything, it can look up by times, why can't it look up by the key? Or, is it something by design that Hudi doesn't intend to do.
   
   Thanks again, just trying to understand what questions are right in terms of Hudi design.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org