You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Govindarajan (Jira)" <ji...@apache.org> on 2022/04/25 04:03:00 UTC

[jira] [Reopened] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

     [ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Govindarajan reopened HUDI-1790:
---------------------------------------

> Add SqlSource for DeltaStreamer to support backfill use cases
> -------------------------------------------------------------
>
>                 Key: HUDI-1790
>                 URL: https://issues.apache.org/jira/browse/HUDI-1790
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: deltastreamer
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Major
>              Labels: pull-request-available, sev:normal
>
> Delta Streamer is great for incremental workloads, but we need to support backfills for use cases like adding a new column and backfill only that column for the last 6 months, and if there was a bug in our transformation logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then I can pass any custom Spark SQL queries selecting specific partitions and backfill.
>  
> When we do the backfill, we don't need to update the last processed commit checkpoint, this has to copy the last processed checkpoint before the backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)