You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/26 15:16:53 UTC

[GitHub] [hudi] n3nash commented on issue #2201: hoodie_commit_timestamp time travel queries - can we implement operational analytics

n3nash commented on issue #2201:
URL: https://github.com/apache/hudi/issues/2201#issuecomment-716615631


   @getniz You can perform time-travel queries using _hoodie_commit_time and achieve your use case of Q1 inventory vs Q2 inventory. It would look something like this -> 
   1) ingest_data_for_Q1 with some commit_time
   2) ingest_data_for_Q2 with some new commit_time
   3) Query against registered tables with begin & end commit instants to limit the data against the commit time
   
   Using Spark
   
   ```
   Dataset<Row> hudiIncQueryDF = spark.read()
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
        .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
        .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY(), <endInstantTime>)
        .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(), "/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain partitions
        .load(tablePath); // For incremental query, pass in the root/base path of table
        
   hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
   spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
   ```
   Using Hive
   
   ```
   hive_shell> set hoodie.source_table_name.consume.mode=incremental
   hive_shell> set hoodie.table_name.consume.start.timestamp=<beginInstantTime>
   convert_endInstantTime_to_num_commits_to_read=5
   hive_shell> set hoodie.table_name.consume.max.commits=5
   hive_shell> select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  source_table_name where fare > 20.0
   ```
   
   Now, if you have deletes, currently HUDI will only apply those changes to the latest "view" of this data. So, it will look like this
   4) ingest_data_for_deletes with some new commit_time
   Now, using this new commit_time you can time travel to Q1 or Q2 or (Q1+Q2+deletes). 
   
   Is that what you're looking for ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org