You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Koifman (JIRA)" <ji...@apache.org> on 2019/01/09 23:49:00 UTC

[jira] [Updated] (HIVE-16258) Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE

     [ https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koifman updated HIVE-16258:
----------------------------------
    Issue Type: New Feature  (was: Improvement)

> Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE
> --------------------------------------------------------------------------
>
>                 Key: HIVE-16258
>                 URL: https://issues.apache.org/jira/browse/HIVE-16258
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Carter Shanklin
>            Priority: Major
>
> Some common data maintenance strategies, especially the Type 2 SCD update, would become substantially easier with a small extension to the SQL standard for MERGE, specifically the ability to say "when matched then insert". Per the standard, matched records can only be updated or deleted.
> In the Type 2 SCD, when a new record comes in you update the old version of the record and insert the new version of the same record. If this extension were supported, sample Type 2 SCD code would look as follows:
> {code}
> merge into customer
> using new_customer_stage stage
> on stage.source_pk = customer.source_pk
> when not matched then insert values    /* Insert a net new record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null)
> when matched then update set           /* Update an old record to mark it as out-of-date */
>   is_current = false, end_date = current_date()
> when matched then insert values        /* Insert a new current record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null);
> {code}
> Without this support, the user needs to devise some sort of workaround. A common approach is to first left join the staging table against the table to be updated, then to join these results to a helper table that will spit out two records for each match and one record for each miss. One of the matching records needs to have a join key that can never occur in the source data so this requires precise knowledge of the source dataset.
> An example of this:
> {code}
> merge into customer
> using (
>   select
>     *,
>     coalesce(invalid_key, source_pk) as join_key
>   from (
>     select
>       stage.source_pk, stage.name, stage.state,
>       case when customer.source_pk is null then 1
>       when stage.name <> customer.name or stage.state <> customer.state then 2
>       else 0 end as scd_row_type
>     from
>       new_customer_stage stage
>     left join
>       customer
>     on (stage.source_pk = customer.source_pk and customer.is_current = true)
>   ) updates
>   join scd_types on scd_types.type = scd_row_type
> ) sub
> on sub.join_key = customer.source_pk
> when matched then update set
>   is_current = false,
>   end_date = current_date()
> when not matched then insert values
>   (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, null);
> select * from customer order by source_pk;
> {code}
> This code is very complicated and will fail if the "invalid" key ever shows up in the source dataset. This simple extension provides a lot of value and likely very little maintenance overhead.
> /cc [~ekoifman]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)