You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2023/02/08 08:37:00 UTC

[jira] [Commented] (HUDI-5730) Support for SCD2 with hudi

    [ https://issues.apache.org/jira/browse/HUDI-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685757#comment-17685757 ] 

sivabalan narayanan commented on HUDI-5730:
-------------------------------------------

For deltastreamer transformer, here is how one can achieve this. 

Pseudo code for the transformer: 

// Input batch could have mix of inserts and updates. We need to split this into 3 categories, namely, insert only records, records which are getting updated and will be active, records that are overridden will belong to  inactive batch. 

1. Read all records from hudi

2. Find inserts by doing left-anti join w/ target hudi table. (input df). lets call this insertDf

3. Join (inner join) inputDf w/ target Df to find the set of records that are being updated. Split this into two dataframes. a: to be updated df (end time is inifinite/max) and activeIndex = 1. b: to be overwritten df (endtime = new start time -1  and active index = 0). 

4. Once we have all 3 dfs, we need to union them. 

 

Notes:
 * Record key should be complex and should have "active_index" as part of the record key fields

 

I have a runbook which achieves SCD2. [https://gist.github.com/nsivabalan/e0153f07f972d02e111761243190e7d1] 

this leverages our quick start guide as an example. Feel free to fix the schema, partitioning columns, etc as you see fit. 

 

> Support for SCD2 with hudi
> --------------------------
>
>                 Key: HUDI-5730
>                 URL: https://issues.apache.org/jira/browse/HUDI-5730
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> Sometimes users have requirements to support SCD2 with hudi. Let's see if we need a custom payload or if we can write a transformer in deltastreamer to achieve this. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)