You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Koifman (JIRA)" <ji...@apache.org> on 2019/01/09 23:49:00 UTC
[jira] [Updated] (HIVE-16258) Suggestion: simplify type 2 SCDs with
this non-standard extension to MERGE
[ https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman updated HIVE-16258:
----------------------------------
Issue Type: New Feature (was: Improvement)
> Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE
> --------------------------------------------------------------------------
>
> Key: HIVE-16258
> URL: https://issues.apache.org/jira/browse/HIVE-16258
> Project: Hive
> Issue Type: New Feature
> Components: Transactions
> Affects Versions: 2.2.0
> Reporter: Carter Shanklin
> Priority: Major
>
> Some common data maintenance strategies, especially the Type 2 SCD update, would become substantially easier with a small extension to the SQL standard for MERGE, specifically the ability to say "when matched then insert". Per the standard, matched records can only be updated or deleted.
> In the Type 2 SCD, when a new record comes in you update the old version of the record and insert the new version of the same record. If this extension were supported, sample Type 2 SCD code would look as follows:
> {code}
> merge into customer
> using new_customer_stage stage
> on stage.source_pk = customer.source_pk
> when not matched then insert values /* Insert a net new record */
> (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null)
> when matched then update set /* Update an old record to mark it as out-of-date */
> is_current = false, end_date = current_date()
> when matched then insert values /* Insert a new current record */
> (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null);
> {code}
> Without this support, the user needs to devise some sort of workaround. A common approach is to first left join the staging table against the table to be updated, then to join these results to a helper table that will spit out two records for each match and one record for each miss. One of the matching records needs to have a join key that can never occur in the source data so this requires precise knowledge of the source dataset.
> An example of this:
> {code}
> merge into customer
> using (
> select
> *,
> coalesce(invalid_key, source_pk) as join_key
> from (
> select
> stage.source_pk, stage.name, stage.state,
> case when customer.source_pk is null then 1
> when stage.name <> customer.name or stage.state <> customer.state then 2
> else 0 end as scd_row_type
> from
> new_customer_stage stage
> left join
> customer
> on (stage.source_pk = customer.source_pk and customer.is_current = true)
> ) updates
> join scd_types on scd_types.type = scd_row_type
> ) sub
> on sub.join_key = customer.source_pk
> when matched then update set
> is_current = false,
> end_date = current_date()
> when not matched then insert values
> (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, null);
> select * from customer order by source_pk;
> {code}
> This code is very complicated and will fail if the "invalid" key ever shows up in the source dataset. This simple extension provides a lot of value and likely very little maintenance overhead.
> /cc [~ekoifman]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)