You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Lefty Leverenz (JIRA)" <ji...@apache.org> on 2015/08/30 05:37:45 UTC
[jira] [Comment Edited] (HIVE-10924) add support for MERGE
statement
[ https://issues.apache.org/jira/browse/HIVE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720334#comment-14720334 ]
Lefty Leverenz edited comment on HIVE-10924 at 8/30/15 3:37 AM:
----------------------------------------------------------------
h3. Feature design notes
Hive supports [multi-insert statement|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries]. The idea is that you can execute a select statement and split the result stream into several to write to multiple targets.
This matches very closely to what MERGE statement needs to do.
When modeling MERGE as multi-insert, we'd split the stream into 2 stream, 1 for the insert part, 1 for update part but write both results to the same table.
Section 14.12 of ISO/IEC 9075-2:2011(E) (SQL 2011) defines MERGE statement.
Suppose we have tables
{code:SQL}
CREATE TABLE target(a int, b int, c int);
CREATE TABLE source(x int, y int, z int);
{code}
Then an example that covers most possibilities might look like this
{code:SQL}
MERGE INTO target
USING source ON b = y
WHEN MATCHED AND c + 1 + z > 0
THEN UPDATE SET a = 1, c = z
WHEN NOT MATCHED AND z IS NULL
THEN INSERT(a,b) VALUES(z, 7)
{code}
\\
\\
And is interpreted as follows
\\
\\
|| Line || Statement Part || Notes ||
| 1 | {code:SQL} MERGE INTO target {code} | Specifies the table being modified |
| 2 | {code:SQL} USING source {code} | specifies the source of the data which may be a table or expression such as SELECT … FROM … |
| 3 | {code:SQL} ON b = y {code} | is interpreted like exactly like an ON clause of a JOIN between source and target. |
| 4 | {code:SQL} WHEN MATCHED {code} | Applies if expr in ON is true |
| 5 | {code:SQL} AND c + 1 + z > 0 {code} | Additional predicate to test before performing the action. |
| 6 | {code:SQL} THEN UPDATE SET a = 1, c = z {code} | May be UPDATE or DELETE. The later deletes the row from target. SET clause is exactly like in regular UPDATE stmt. |
| 7 | {code:SQL} WHEN NOT MATCHED {code} | Applies if expr in ON is false |
| 8 | {code:SQL} AND z IS NULL {code} | Additional predicate to test before performing the action. |
| 9 | {code:SQL} THEN INSERT(a,b) VALUES(z, 7){code} | Insert to perform on target. |
\\
\\
Then the "equivalent" _multi-insert statement_ looks like this:
\\
\\
|| Statement Part || Refernce to previous table ||
| {code:SQL} FROM (SELECT * FROM target RIGHT OUTER JOIN SOURCE ON b = y) {code} | Lines 1 - 3 |
| {code:SQL} INSERT INTO target(a,c) SELECT 1, z {code} | This represents the update part of merge; Line 6 |
| {code:SQL} WHERE c + 1 + z > 0 {code} | Line 5 |
| {code:SQL} AND b = y {code} | Only include ‘matched’ rows; Line 4 |
| {code:SQL} INSERT INTO target(a,b) SELECT z, 7 {code} | This represents the ‘insert’ part of merge; Line 9 |
| {code:SQL} WHERE z IS NULL {code} | Line 8 |
| {code:SQL} AND a = null AND b = null AND c = null; {code} | Only include ‘not matched’ rows; Line 7 |
h4. Some caveats
# Current multi-insert doesn’t support writing to the same table more than once. Can we fix this?
# This requires the same change as for multi-statement txn, that is to support multiple delta files per transaction. (HIVE-11030)
# Requires annotating each insert (of multi-insert) with whether it’s doing update/delete or insert
Since Hive can already compile an operator pipeline for such a _multi-insert statement_ (almost) support for MERGE doesn't require additional operators.
Also, Update/Delete are actually compiled int Insert statements.
was (Author: ekoifman):
h3. Feature design notes
Hive supports [multi-insert statement|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries]. The idea is that you can execute a select statement and split the result stream into several to write to multiple targets.
This matches very closely to what MERGE statement needs to do.
When modeling MERGE as multi-insert, we'd split the stream into 2 stream, 1 for the insert part, 1 for update part but write both results to the same table.
Section 14.12 of ISO/IEC 9075-2:2011(E) (SQL 2011) defines MERGE statement.
Suppose we have tables
{code:SQL}
CREATE TABLE target(a int, b int, c int);
CREATE TABLE source(x int, y int, z int);
{code}
Then an example that covers most possibilities might look like this
{code:SQL}
MERGE INTO target
USING source ON b = y
WHEN MATCHED AND c + 1 + z > 0
THEN THEN UPDATE SET a = 1, c = z
WHEN NOT MATCHED AND z IS NULL
THEN INSERT(a,b) VALUES(z, 7)
{code}
\\
\\
And is interpreted as follows
\\
\\
|| Line || Statement Part || Notes ||
| 1 | {code:SQL} MERGE INTO target {code} | Specifies the table being modified |
| 2 | {code:SQL} USING source {code} | specifies the source of the data which may be a table or expression such as SELECT … FROM … |
| 3 | {code:SQL} ON b = y {code} | is interpreted like exactly like an ON clause of a JOIN between source and target. |
| 4 | {code:SQL} WHEN MATCHED {code} | Applies if expr in ON is true |
| 5 | {code:SQL} AND c + 1 + z > 0 {code} | Additional predicate to test before performing the action. |
| 6 | {code:SQL} THEN UPDATE SET a = 1, c = z {code} | May be UPDATE or DELETE. The later deletes the row from target. SET clause is exactly like in regular UPDATE stmt. |
| 7 | {code:SQL} WHEN NOT MATCHED {code} | Applies if expr in ON is false |
| 8 | {code:SQL} AND z IS NULL {code} | Additional predicate to test before performing the action. |
| 9 | {code:SQL} THEN INSERT(a,b) VALUES(z, 7){code} | Insert to perform on target. |
\\
\\
Then the "equivalent" _multi-insert statement_ looks like this:
\\
\\
|| Statement Part || Refernce to previous table ||
| {code:SQL} FROM (SELECT * FROM target RIGHT OUTER JOIN SOURCE ON b = y) {code} | Lines 1 - 3 |
| {code:SQL} INSERT INTO target(a,c) SELECT 1, z {code} | This represents the update part of merge; Line 6 |
| {code:SQL} WHERE c + 1 + z > 0 {code} | Line 5 |
| {code:SQL} AND b = y {code} | Only include ‘matched’ rows; Line 4 |
| {code:SQL} INSERT INTO target(a,b) SELECT z, 7 {code} | This represents the ‘insert’ part of merge; Line 9 |
| {code:SQL} WHERE z IS NULL {code} | Line 8 |
| {code:SQL} AND a = null AND b = null AND c = null; {code} | Only include ‘not matched’ rows; Line 7 |
h4. Some caveats
# Current multi-insert doesn’t support writing to the same table more than once. Can we fix this?
# This requires the same change as for multi-statement txn, that is to support multiple delta files per transaction. (HIVE-11030)
# Requires annotating each insert (of multi-insert) with whether it’s doing update/delete or insert
Since Hive can already compile an operator pipeline for such a _multi-insert statement_ (almost) support for MERGE doesn't require additional operators.
Also, Update/Delete are actually compiled int Insert statements.
> add support for MERGE statement
> -------------------------------
>
> Key: HIVE-10924
> URL: https://issues.apache.org/jira/browse/HIVE-10924
> Project: Hive
> Issue Type: New Feature
> Components: Query Planning, Query Processor, Transactions
> Affects Versions: 1.2.0
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
>
> add support for
> MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)