You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/12/08 15:33:02 UTC

[GitHub] [hive] kasakrisz opened a new pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

kasakrisz opened a new pull request #2857:
URL: https://github.com/apache/hive/pull/2857


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   When generating the  insert part of an update clause of a merge statement replace the default keywords to the corresponding default values defined in default constraints.
   
   ### Why are the changes needed?
   To support splitting update clauses of merge statements into two insert branches: one for inserting new values and one for inserting into delete deltas. The first insert doesn't have to be sorted which can give a performance boost. See [HIVE-21158](https://issues.apache.org/jira/browse/HIVE-21158) and [HIVE-21159](https://issues.apache.org/jira/browse/HIVE-21159) for more details.
   When 'default' keyword is used the optimization can not be triggered because the compilation of the rewritten statement failed.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   ```
   mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestMiniLlapLocalCliDriver -Dqfile=sqlmerge_stats.q -pl itests/qtest -Pitests
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r769380412



##########
File path: ql/src/test/results/clientpositive/llap/acid_direct_update_delete_with_merge.q.out
##########
@@ -112,11 +110,13 @@ POSTHOOK: Input: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Output: default@transactions
 POSTHOOK: Output: default@transactions@tran_date=20170410
-POSTHOOK: Output: default@transactions@tran_date=20170410
 POSTHOOK: Output: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@transactions@tran_date=20170415
 POSTHOOK: Lineage: merge_tmp_table.val EXPRESSION [(transactions)t.FieldSchema(name:ROW__ID, type:struct<writeId:bigint,bucketId:int,rowId:bigint>, comment:), (transactions)t.FieldSchema(name:tran_date, type:string, comment:null), ]
+POSTHOOK: Lineage: transactions PARTITION(tran_date=20170413).id SIMPLE [(merge_source)s.FieldSchema(name:id, type:int, comment:null), ]

Review comment:
       same as above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r768692003



##########
File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCrudCompactorOnTez.java
##########
@@ -1711,13 +1713,13 @@ public void testMajorCompactionAfterTwoMergeStatements() throws Exception {
 
     // Verify contents of bucket files.
     List<String> expectedRsBucket0 = Arrays.asList("{\"writeid\":1,\"bucketid\":536870912,\"rowid\":3}\t4\tvalue_4",
-        "{\"writeid\":2,\"bucketid\":536870912,\"rowid\":0}\t6\tvalue_6",
-        "{\"writeid\":2,\"bucketid\":536870913,\"rowid\":2}\t3\tnewvalue_3",
-        "{\"writeid\":3,\"bucketid\":536870912,\"rowid\":0}\t8\tvalue_8",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":0}\t5\tnewestvalue_5",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":1}\t7\tnewestvalue_7",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":2}\t1\tnewestvalue_1",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":3}\t2\tnewestvalue_2");
+    "{\"writeid\":2,\"bucketid\":536870913,\"rowid\":2}\t3\tnewvalue_3",

Review comment:
       iiuc this test was created to verify the order of records in the bucket files after major compaction.
   Based on description of https://issues.apache.org/jira/browse/HIVE-25257
   It should be ordered by originalTransactionId, bucketProperty, rowId.
   Unfortunately originalTransactionId can not be queried so I debugged the test and stop the execution before this assert. Then I dumped the orc file created on my local fs:
   ```
   java -jar orc-tools-1.6.5/orc-tools-1.6.5-uber.jar data ./itests/hive-unit/target/tmp/org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-1639489721524_1338883398/warehouse/comp_and_merge_test/base_0000003_v0000014/bucket_00000
   Processing data file itests/hive-unit/target/tmp/org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-1639489721524_1338883398/warehouse/comp_and_merge_test/base_0000003_v0000014/bucket_00000 [length: 808]
   {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_4"}}
   {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":3,"value":"newvalue_3"}}
   {"operation":0,"originaltransaction":2,"bucket":536870914,"rowid":0,"currenttransaction":2,"row":{"id":6,"value":"value_6"}}
   {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":1,"value":"newestvalue_1"}}
   {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
   {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":5,"value":"newestvalue_5"}}
   {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":7,"value":"newestvalue_7"}}
   {"operation":0,"originaltransaction":3,"bucket":536870914,"rowid":0,"currenttransaction":3,"row":{"id":8,"value":"value_8"}}
   ________________________________________________________________________________________________________________________
   ```
   Order seems to be valid.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz merged pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kasakrisz merged pull request #2857:
URL: https://github.com/apache/hive/pull/2857


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r769526713



##########
File path: ql/src/test/results/clientpositive/llap/explain_locks.q.out
##########
@@ -233,20 +221,14 @@ POSTHOOK: Input: default@target@p=2/q=2
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Output: default@target
 POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=2/q=2
-POSTHOOK: Output: default@target@p=2/q=2
 LOCK INFORMATION:
 default.source -> SHARED_READ
 default.target.p=1/q=2 -> SHARED_READ
 default.target.p=1/q=3 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_WRITE
-default.target.p=2/q=2 -> SHARED_WRITE

Review comment:
       If `hive.merge.split.update` is `false` the merge statement has three type of outputs: `insert`, `delete`, `update` the same partitioned table.
   
   `RewriteSemanticAnalyzer.updateOutputs()` adds the affected partitions twice: one set for `delete` and one for `update`
   https://github.com/apache/hive/blob/6f29c1acf04fa1f1b84a6fb86f92bdc65b584567/ql/src/java/org/apache/hadoop/hive/ql/parse/RewriteSemanticAnalyzer.java#L339
   
   When `hive.merge.split.update` is `true` only `insert` and `delete` outputs are present so partitions are listed only once.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r768476035



##########
File path: ql/src/test/results/clientpositive/llap/masking_acid_no_masking.q.out
##########
@@ -54,8 +53,9 @@ POSTHOOK: Input: default@masking_acid_no_masking
 POSTHOOK: Input: default@nonacid_n0
 POSTHOOK: Output: default@masking_acid_no_masking
 POSTHOOK: Output: default@masking_acid_no_masking
-POSTHOOK: Output: default@masking_acid_no_masking
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Lineage: masking_acid_no_masking.key SIMPLE [(nonacid_n0)s.FieldSchema(name:key, type:int, comment:null), ]
+POSTHOOK: Lineage: masking_acid_no_masking.key SIMPLE [(nonacid_n0)s.FieldSchema(name:key, type:int, comment:null), ]

Review comment:
       is this a duplicate?

##########
File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCrudCompactorOnTez.java
##########
@@ -1711,13 +1713,13 @@ public void testMajorCompactionAfterTwoMergeStatements() throws Exception {
 
     // Verify contents of bucket files.
     List<String> expectedRsBucket0 = Arrays.asList("{\"writeid\":1,\"bucketid\":536870912,\"rowid\":3}\t4\tvalue_4",
-        "{\"writeid\":2,\"bucketid\":536870912,\"rowid\":0}\t6\tvalue_6",
-        "{\"writeid\":2,\"bucketid\":536870913,\"rowid\":2}\t3\tnewvalue_3",
-        "{\"writeid\":3,\"bucketid\":536870912,\"rowid\":0}\t8\tvalue_8",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":0}\t5\tnewestvalue_5",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":1}\t7\tnewestvalue_7",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":2}\t1\tnewestvalue_1",
-        "{\"writeid\":3,\"bucketid\":536870913,\"rowid\":3}\t2\tnewestvalue_2");
+    "{\"writeid\":2,\"bucketid\":536870913,\"rowid\":2}\t3\tnewvalue_3",

Review comment:
       seeing a change like this I wonder how much value this test adds...

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/MergeSemanticAnalyzer.java
##########
@@ -441,6 +442,11 @@ private String handleUpdate(ASTNode whenMatchedUpdateClause, StringBuilder rewri
         default:
           //do nothing
         }
+
+        if ("`default`".equalsIgnoreCase(rhsExp.trim())) {
+          rhsExp = MapUtils.getString(colNameToDefaultConstraint, name, "null");

Review comment:
       iiuc this makes changes the column value to the default if we see the "default" in the query
   
   I see that this also work for a plain insert:
   ```
   create table q2(a string default 'asd');
   insert into q2 values(`default`)
   select * from q2;
   ```
   
   however I think the standard suggest to use the `DEFAULT` keyword and not as a string literal; in Hive we seem to also "support" the "default" as a string literal to be interpreted as default.
   
   I think because of various reasons - the default keyword becomes the string default at some point and it works like that right now.
   
   could you open a follow-up to fix the `default` literal's handling?
   

##########
File path: ql/src/test/results/clientpositive/llap/explain_locks.q.out
##########
@@ -233,20 +221,14 @@ POSTHOOK: Input: default@target@p=2/q=2
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Output: default@target
 POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=2/q=2
-POSTHOOK: Output: default@target@p=2/q=2
 LOCK INFORMATION:
 default.source -> SHARED_READ
 default.target.p=1/q=2 -> SHARED_READ
 default.target.p=1/q=3 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_WRITE
-default.target.p=2/q=2 -> SHARED_WRITE

Review comment:
       I wonder why were these lock duplicated? 

##########
File path: ql/src/test/results/clientpositive/llap/acid_direct_update_delete_with_merge.q.out
##########
@@ -112,11 +110,13 @@ POSTHOOK: Input: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Output: default@transactions
 POSTHOOK: Output: default@transactions@tran_date=20170410
-POSTHOOK: Output: default@transactions@tran_date=20170410
 POSTHOOK: Output: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@transactions@tran_date=20170413
 POSTHOOK: Output: default@transactions@tran_date=20170415
 POSTHOOK: Lineage: merge_tmp_table.val EXPRESSION [(transactions)t.FieldSchema(name:ROW__ID, type:struct<writeId:bigint,bucketId:int,rowId:bigint>, comment:), (transactions)t.FieldSchema(name:tran_date, type:string, comment:null), ]
+POSTHOOK: Lineage: transactions PARTITION(tran_date=20170413).id SIMPLE [(merge_source)s.FieldSchema(name:id, type:int, comment:null), ]

Review comment:
       is this a duplicate lineage entry?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r769529649



##########
File path: ql/src/test/results/clientpositive/llap/explain_locks.q.out
##########
@@ -233,20 +221,14 @@ POSTHOOK: Input: default@target@p=2/q=2
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Output: default@target
 POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=2
-POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=1/q=3
 POSTHOOK: Output: default@target@p=2/q=2
-POSTHOOK: Output: default@target@p=2/q=2
 LOCK INFORMATION:
 default.source -> SHARED_READ
 default.target.p=1/q=2 -> SHARED_READ
 default.target.p=1/q=3 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_READ
 default.target.p=2/q=2 -> SHARED_WRITE
-default.target.p=2/q=2 -> SHARED_WRITE

Review comment:
       thank you for the explanation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2857: HIVE-21172: DEFAULT keyword handling in MERGE UPDATE clause issues

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2857:
URL: https://github.com/apache/hive/pull/2857#discussion_r768652584



##########
File path: ql/src/test/results/clientpositive/llap/masking_acid_no_masking.q.out
##########
@@ -54,8 +53,9 @@ POSTHOOK: Input: default@masking_acid_no_masking
 POSTHOOK: Input: default@nonacid_n0
 POSTHOOK: Output: default@masking_acid_no_masking
 POSTHOOK: Output: default@masking_acid_no_masking
-POSTHOOK: Output: default@masking_acid_no_masking
 POSTHOOK: Output: default@merge_tmp_table
 POSTHOOK: Lineage: masking_acid_no_masking.key SIMPLE [(nonacid_n0)s.FieldSchema(name:key, type:int, comment:null), ]
+POSTHOOK: Lineage: masking_acid_no_masking.key SIMPLE [(nonacid_n0)s.FieldSchema(name:key, type:int, comment:null), ]

Review comment:
       These lineages are generated by the MoveTask when inserting.
   
   By turning on `hive.merge.split.update` update branch of merge statements are splitted into a insert and a delete branch.
   
   Originally this merge had only one insert branch but now it has two to the same table same columns:
   - one for the insert branch
   - one for the update branch
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org