You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/09 01:04:30 UTC

[GitHub] [hudi] Vikas-kum opened a new issue #3054: [QUESTION] Point query at hudi tables

Vikas-kum opened a new issue #3054:
URL: https://github.com/apache/hudi/issues/3054


   
   Does Hudi supports point in time queries. I want to know if I can query the value of specific row from table at certain time instant. 
   
   For example:
   select * from ABC where key='p1' and event_time='t1'
   
   Also, if I have lot of query like this, is there efficient recommended way to achieve this?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-865424978


   @FelixKJose You can do time travel in the following way: 
   
   **Using Spark**
   
   ```
   Dataset<Row> hudiIncQueryDF = spark.read()
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
        .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
        .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY(), <endInstantTime>)
        .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(), "/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain partitions
        .load(tablePath); // For incremental query, pass in the root/base path of table
        
   hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
   spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
   ```
   
   **Using Hive**
   
   ```
   hive_shell> set hoodie.source_table_name.consume.mode=incremental
   hive_shell> set hoodie.table_name.consume.start.timestamp=<beginInstantTime>
   convert_endInstantTime_to_num_commits_to_read=5
   hive_shell> set hoodie.table_name.consume.max.commits=5
   hive_shell> select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  source_table_name where fare > 20.0
   ```
   
   Ideally, we should add a `hoodie.table_name.consume.end.timestamp` to support the same behavior in Hive. 
   
   @fengjian428 For the incremental pull using Spark, the INCR_PATH_GLOB_OPT_KEY will just be used to incrementally pull data based on commit ranges. This works on file level. If you want to query data within a commit range based on other columns and then use that as "incremental pull" - Yes, that's where the data skipping index will be helpful. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-905617746


   Latest [docs](https://hudi.apache.org/docs/next/quick-start-guide) already had examples on time travel query. Search for "Time Travel Query" in the page. 
   Closing this out. Feel free to reopen or create a new issue if you still have requirements to be met. thanks ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-859303754


   @Vikas-kum We are working on column level and record level indexes which will can this kind of query really fast, read here -> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance.
   
   In the meantime, to answer your question, yes Hudi does support Point in time queries in Presto and Spark but not in Hive. Let me know if you need any other information. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-865424978


   @FelixKJose You can do time travel in the following way: 
   
   **Using Spark**
   
   ```
   Dataset<Row> hudiIncQueryDF = spark.read()
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
        .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
        .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY(), <endInstantTime>)
        .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(), "/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain partitions
        .load(tablePath); // For incremental query, pass in the root/base path of table
        
   hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
   spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
   ```
   
   **Using Hive**
   
   ```
   hive_shell> set hoodie.source_table_name.consume.mode=incremental
   hive_shell> set hoodie.table_name.consume.start.timestamp=<beginInstantTime>
   convert_endInstantTime_to_num_commits_to_read=5
   hive_shell> set hoodie.table_name.consume.max.commits=5
   hive_shell> select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  source_table_name where fare > 20.0
   ```
   
   Ideally, we should add a `hoodie.table_name.consume.end.timestamp` to support the same behavior in Hive. 
   
   @fengjian428 For the incremental pull using Spark, the INCR_PATH_GLOB_OPT_KEY will just be used to incrementally pull data based on commit ranges. This works on file level. If you want to query data within a commit range based on other columns and then use that as "incremental pull" - Yes, that's where the data skipping index will be helpful. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] FelixKJose commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

FelixKJose commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-860685598


   @n3nash Could you please give more details on how is it supported in Presto and Spark? I mean, do I have to provide some specific configurations etc and does it support in both MOR and COW table types? Why I am asking this is, RFC-7 (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table) seems inactive and I haven't seen any documentation regarding pooin-in-time query support on HUDI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-905439795


   @Vikas-kum Can we close this issue? Are there any pending questions based on above comment?
   FYI, point queries using spark sql should be doable in the upcoming release. See #3360 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #3054:
URL: https://github.com/apache/hudi/issues/3054


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fengjian428 commented on issue #3054: [SUPPORT] Point query at hudi tables

Posted by GitBox <gi...@apache.org>.

fengjian428 commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-861154782


   @n3nash is this new data skipping index can improve incremental&point query‘s performance？seems when using  incremental&point query, need use INCR_PATH_GLOB_OPT_KEY to set pattern to filter on path， otherwise query will pull all the data in commit time range 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org