You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/11/27 20:02:37 UTC

[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once column contributing to this strategy. e.g. When a row is creation, only `created_at` column is set and `updated_at` is null by default. When the same row is updated, `updated_at` gets assigned to some timestamp. In such cases it is wise to consider both columns in the query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm aware that sorting adds load on Database, but it helps in tracking the last pulled timestamp or auto incrementing id and help retry/resume from the point last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and `updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > $last_recorder_time AND COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < $current_time ORDER BY COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services