You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Lefty Leverenz (JIRA)" <ji...@apache.org> on 2015/08/30 05:37:45 UTC

[jira] [Comment Edited] (HIVE-10924) add support for MERGE statement

    [ https://issues.apache.org/jira/browse/HIVE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720334#comment-14720334 ] 

Lefty Leverenz edited comment on HIVE-10924 at 8/30/15 3:37 AM:
----------------------------------------------------------------

h3. Feature design notes
Hive supports [multi-insert statement|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries].  The idea is that you can execute a select statement and split the result stream into several to write to multiple targets.

This matches very closely to what MERGE statement needs to do.
When modeling MERGE as multi-insert, we'd split the stream into 2 stream, 1 for the insert part, 1 for update part but write both results to the same table.  
Section 14.12 of  ISO/IEC 9075-2:2011(E) (SQL 2011) defines MERGE statement.   

Suppose we have tables 
{code:SQL}
CREATE TABLE target(a int, b int, c int);
CREATE TABLE source(x int, y int, z int);
{code}

Then an example that covers most possibilities might look like this
{code:SQL}
MERGE INTO target 
USING source ON b = y
WHEN MATCHED AND c + 1 + z > 0
THEN UPDATE SET a = 1, c = z
WHEN NOT MATCHED AND z IS NULL
THEN INSERT(a,b) VALUES(z, 7)
{code}
\\
\\
And is interpreted as follows
\\
\\
|| Line || Statement Part || Notes ||
| 1 | {code:SQL} MERGE INTO target {code} | Specifies the table being modified |
| 2 | {code:SQL} USING source {code} | specifies the source of the data which may be a table or expression such as SELECT … FROM … |
| 3 | {code:SQL}     ON b = y {code} | is interpreted like exactly like an ON clause of a JOIN between source and target. |
| 4 | {code:SQL} WHEN MATCHED {code} | Applies if expr in ON is true |
| 5 | {code:SQL}     AND c + 1 + z > 0 {code} | Additional predicate to test before performing the action. |
| 6 | {code:SQL}     THEN UPDATE SET a = 1, c = z {code} | May be UPDATE or DELETE.  The later deletes the row from target.  SET clause is exactly like in regular UPDATE stmt. |
| 7 | {code:SQL} WHEN NOT MATCHED {code} | Applies if expr in ON is false |
| 8 | {code:SQL}     AND z IS NULL {code} | Additional predicate to test before performing the action. |
| 9 | {code:SQL} THEN INSERT(a,b) VALUES(z, 7){code} | Insert to perform on target. |
\\
\\

Then the "equivalent" _multi-insert statement_ looks like this:
\\
\\
|| Statement Part ||  Refernce to previous table ||
| {code:SQL} FROM (SELECT * FROM target RIGHT OUTER JOIN SOURCE ON b = y) {code} | Lines 1 - 3 | 
| {code:SQL} INSERT INTO target(a,c) SELECT 1, z {code} | This represents the update part of merge; Line 6 |
| {code:SQL} WHERE c + 1 + z > 0 {code} | Line 5 |
| {code:SQL} AND b = y {code} | Only include ‘matched’ rows; Line 4 |
| {code:SQL} INSERT INTO target(a,b) SELECT z, 7 {code} | This represents the ‘insert’ part of merge; Line 9 |
| {code:SQL} WHERE z IS NULL {code} | Line 8 |
| {code:SQL} AND a = null AND b = null AND c = null; {code} | Only include ‘not matched’ rows; Line 7 |

h4. Some caveats
# Current multi-insert doesn’t support writing to the same table more than once.  Can we fix this?
# This requires the same change as for multi-statement txn, that is to support multiple delta files per transaction. (HIVE-11030)
# Requires annotating each insert (of multi-insert) with whether it’s doing update/delete or insert


Since Hive can already compile an operator pipeline for such a _multi-insert statement_ (almost) support for MERGE doesn't require additional operators.
Also, Update/Delete are actually compiled int Insert statements.


was (Author: ekoifman):
h3. Feature design notes
Hive supports [multi-insert statement|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries].  The idea is that you can execute a select statement and split the result stream into several to write to multiple targets.

This matches very closely to what MERGE statement needs to do.
When modeling MERGE as multi-insert, we'd split the stream into 2 stream, 1 for the insert part, 1 for update part but write both results to the same table.  
Section 14.12 of  ISO/IEC 9075-2:2011(E) (SQL 2011) defines MERGE statement.   

Suppose we have tables 
{code:SQL}
CREATE TABLE target(a int, b int, c int);
CREATE TABLE source(x int, y int, z int);
{code}

Then an example that covers most possibilities might look like this
{code:SQL}
MERGE INTO target 
USING source ON b = y
WHEN MATCHED AND c + 1 + z > 0
THEN THEN UPDATE SET a = 1, c = z
WHEN NOT MATCHED AND z IS NULL
THEN INSERT(a,b) VALUES(z, 7)
{code}
\\
\\
And is interpreted as follows
\\
\\
|| Line || Statement Part || Notes ||
| 1 | {code:SQL} MERGE INTO target {code} | Specifies the table being modified |
| 2 | {code:SQL} USING source {code} | specifies the source of the data which may be a table or expression such as SELECT … FROM … |
| 3 | {code:SQL}     ON b = y {code} | is interpreted like exactly like an ON clause of a JOIN between source and target. |
| 4 | {code:SQL} WHEN MATCHED {code} | Applies if expr in ON is true |
| 5 | {code:SQL}     AND c + 1 + z > 0 {code} | Additional predicate to test before performing the action. |
| 6 | {code:SQL}     THEN UPDATE SET a = 1, c = z {code} | May be UPDATE or DELETE.  The later deletes the row from target.  SET clause is exactly like in regular UPDATE stmt. |
| 7 | {code:SQL} WHEN NOT MATCHED {code} | Applies if expr in ON is false |
| 8 | {code:SQL}     AND z IS NULL {code} | Additional predicate to test before performing the action. |
| 9 | {code:SQL} THEN INSERT(a,b) VALUES(z, 7){code} | Insert to perform on target. |
\\
\\

Then the "equivalent" _multi-insert statement_ looks like this:
\\
\\
|| Statement Part ||  Refernce to previous table ||
| {code:SQL} FROM (SELECT * FROM target RIGHT OUTER JOIN SOURCE ON b = y) {code} | Lines 1 - 3 | 
| {code:SQL} INSERT INTO target(a,c) SELECT 1, z {code} | This represents the update part of merge; Line 6 |
| {code:SQL} WHERE c + 1 + z > 0 {code} | Line 5 |
| {code:SQL} AND b = y {code} | Only include ‘matched’ rows; Line 4 |
| {code:SQL} INSERT INTO target(a,b) SELECT z, 7 {code} | This represents the ‘insert’ part of merge; Line 9 |
| {code:SQL} WHERE z IS NULL {code} | Line 8 |
| {code:SQL} AND a = null AND b = null AND c = null; {code} | Only include ‘not matched’ rows; Line 7 |

h4. Some caveats
# Current multi-insert doesn’t support writing to the same table more than once.  Can we fix this?
# This requires the same change as for multi-statement txn, that is to support multiple delta files per transaction. (HIVE-11030)
# Requires annotating each insert (of multi-insert) with whether it’s doing update/delete or insert


Since Hive can already compile an operator pipeline for such a _multi-insert statement_ (almost) support for MERGE doesn't require additional operators.
Also, Update/Delete are actually compiled int Insert statements.

> add support for MERGE statement
> -------------------------------
>
>                 Key: HIVE-10924
>                 URL: https://issues.apache.org/jira/browse/HIVE-10924
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Planning, Query Processor, Transactions
>    Affects Versions: 1.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> add support for 
> MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)