You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/03/08 07:44:39 UTC

[GitHub] [hive] kasakrisz opened a new pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

kasakrisz opened a new pull request #2046:
URL: https://github.com/apache/hive/pull/2046


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r600346818



##########
File path: ql/src/test/queries/clientpositive/fetch_deleted_rows.q
##########
@@ -0,0 +1,48 @@
+set hive.support.concurrency=true;
+set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
+SET hive.vectorized.execution.enabled=false;
+
+create table t1(a int, b varchar(128)) stored as orc tblproperties ('transactional'='true');
+
+insert into t1(a,b) values (1, 'one'), (2, 'two');
+
+delete from t1 where a = 1;
+
+insert into t1(a,b) values (3, 'three'), (4, 'four'), (4, 'four again'), (5, 'five');
+
+alter table t1 set tblproperties ('acid.fetch.deleted.rows'='true');
+select t1.ROW__IS__DELETED, * from t1;

Review comment:
       When the logical plan is built from the AST and `HiveTableScans` are instantiated table properties defined in the AST are lost.
   When the optimized logical plan is converted back to AST only properties coming from Druid and JDBC scans are populated. We could add additional properties from `HiveTableScans` in the same way.
   
   I think passing `acid.fetch.deleted.rows` as a table property in the query is a good idea and we can benefit from it  when building an Incremental MV rebuild plan. I will try to include this into [HIVE-24854](https://issues.apache.org/jira/browse/HIVE-24854)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz merged pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz merged pull request #2046:
URL: https://github.com/apache/hive/pull/2046


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599393622



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4450,6 +4450,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_ACID_DIRECT_INSERT_ENABLED("hive.acid.direct.insert.enabled", true,
         "Enable writing the data files directly to the table's final destination instead of the staging directory."
         + "This optimization only applies on INSERT operations on ACID tables."),
+    HIVE_ACID_FETCH_DELETED_ROWS("hive.acid.fetch.deleted.rows", false,

Review comment:
       moved to constants




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599778851



##########
File path: ql/src/test/queries/clientpositive/fetch_deleted_rows.q
##########
@@ -0,0 +1,48 @@
+set hive.support.concurrency=true;
+set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
+SET hive.vectorized.execution.enabled=false;
+
+create table t1(a int, b varchar(128)) stored as orc tblproperties ('transactional'='true');
+
+insert into t1(a,b) values (1, 'one'), (2, 'two');
+
+delete from t1 where a = 1;
+
+insert into t1(a,b) values (3, 'three'), (4, 'four'), (4, 'four again'), (5, 'five');
+
+alter table t1 set tblproperties ('acid.fetch.deleted.rows'='true');
+select t1.ROW__IS__DELETED, * from t1;

Review comment:
       Just leaving a note. I think this syntax passing the table properties in the query should also work (if that's the case, maybe we can add a query with it):
   ```
   select t1.ROW__IS__DELETED, *
   from t1('acid.fetch.deleted.rows'='true');
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599394949



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/BatchToRowReader.java
##########
@@ -108,25 +108,41 @@ public BatchToRowReader(RecordReader<NullWritable, VectorizedRowBatch> vrbReader
     } else {
       Arrays.fill(included, true);
     }
-    // Create struct for ROW__ID virtual column and extract index
-    this.rowIdIdx = vrbCtx.findVirtualColumnNum(VirtualColumn.ROWID);
-    if (this.rowIdIdx >= 0) {
-      included[rowIdIdx] = true;
+
+    virtualColumnHandlers = requestedVirtualColumns();
+    for (VirtualColumnHandler handler : virtualColumnHandlers) {
+      int idx = vrbCtx.findVirtualColumnNum(handler.virtualColumn);
+      if (idx >= 0) {
+        included[idx] = true;
+        handler.indexInSchema = idx;
+      }
     }
+
     if (LOG.isDebugEnabled()) {
       LOG.debug("Including the columns " + DebugUtils.toString(included));
     }
     this.included = included;
   }
 
+  public static class VirtualColumnHandler {

Review comment:
       Added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599094572



##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapInputFormat.java
##########
@@ -182,11 +190,11 @@ static VectorizedRowBatchCtx createFakeVrbCtx(MapWork mapWork) throws HiveExcept
     RowSchema rowSchema = findTsOp(mapWork).getSchema();
     final List<String> colNames = new ArrayList<String>(rowSchema.getSignature().size());
     final List<TypeInfo> colTypes = new ArrayList<TypeInfo>(rowSchema.getSignature().size());
-    boolean hasRowId = false;
+    ArrayList<VirtualColumn> virtualColumnList = new ArrayList<>(2);
     for (ColumnInfo c : rowSchema.getSignature()) {
       String columnName = c.getInternalName();
-      if (VirtualColumn.ROWID.getName().equals(columnName)) {
-        hasRowId = true;
+      if (ALLOWED_VIRTUAL_COLUMNS.containsKey(columnName)) {
+        virtualColumnList.add(ALLOWED_VIRTUAL_COLUMNS.get(columnName));
       } else {
         if (VirtualColumn.VIRTUAL_COLUMN_NAMES.contains(columnName)) continue;

Review comment:
       Not really part of this patch, but this line can be merge with previous one.

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4450,6 +4450,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_ACID_DIRECT_INSERT_ENABLED("hive.acid.direct.insert.enabled", true,
         "Enable writing the data files directly to the table's final destination instead of the staging directory."
         + "This optimization only applies on INSERT operations on ACID tables."),
+    HIVE_ACID_FETCH_DELETED_ROWS("hive.acid.fetch.deleted.rows", false,

Review comment:
       I am not sure this should be part of HiveConf.java since it is not a global/session config per se? Have you considered introducing it in `org.apache.hadoop.hive.conf.Constants`? We have introduced some of these config variables there in the past.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
##########
@@ -1970,8 +1970,22 @@ private int processQueryHint(ASTNode ast, QBParseInfo qbp, int posn) throws Sema
     String queryHintStr = ast.getText();
     LOG.debug("QUERY HINT: {} ", queryHintStr);
     try {
-      ASTNode hintNode = pd.parseHint(queryHintStr);
-      qbp.setHints(hintNode);
+      ASTNode hintListNode = pd.parseHint(queryHintStr);
+      qbp.setHints(hintListNode);
+      for (int i = 0; i < hintListNode.getChildCount(); ++i) {
+        ASTNode hintNode = (ASTNode) hintListNode.getChild(i);
+        if (hintNode.getChild(0).getType() != HintParser.TOK_FETCH_DELETED_ROWS) {

Review comment:
       Should this be a table property rather than a hint, i.e., value for `hive.acid.fetch.deleted.rows` passed as a property for the table scan? That would use a similar pattern to what we do in other code paths in Hive, e.g., to push computation for Druid/JDBC. It also means we do not need to introduce a new hint so it will simplify the patch.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1943,15 +1943,17 @@ public static boolean isFullAcidScan(Configuration conf) {
    *                   we assume this is a full transactional table.
    */
   public static void setAcidOperationalProperties(
-      Configuration conf, boolean isTxnTable, AcidOperationalProperties properties) {
+      Configuration conf, boolean isTxnTable, AcidOperationalProperties properties, boolean fetchDeletedRows) {

Review comment:
       Since in most cases, we would pass false for the last parameter, should we keep the old method signature too, which would call this method with `false` value for the last parameter?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/BatchToRowReader.java
##########
@@ -108,25 +108,41 @@ public BatchToRowReader(RecordReader<NullWritable, VectorizedRowBatch> vrbReader
     } else {
       Arrays.fill(included, true);
     }
-    // Create struct for ROW__ID virtual column and extract index
-    this.rowIdIdx = vrbCtx.findVirtualColumnNum(VirtualColumn.ROWID);
-    if (this.rowIdIdx >= 0) {
-      included[rowIdIdx] = true;
+
+    virtualColumnHandlers = requestedVirtualColumns();
+    for (VirtualColumnHandler handler : virtualColumnHandlers) {
+      int idx = vrbCtx.findVirtualColumnNum(handler.virtualColumn);
+      if (idx >= 0) {
+        included[idx] = true;
+        handler.indexInSchema = idx;
+      }
     }
+
     if (LOG.isDebugEnabled()) {
       LOG.debug("Including the columns " + DebugUtils.toString(included));
     }
     this.included = included;
   }
 
+  public static class VirtualColumnHandler {

Review comment:
       Can you add a short comment for the inner class?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599733588



##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapInputFormat.java
##########
@@ -182,11 +190,11 @@ static VectorizedRowBatchCtx createFakeVrbCtx(MapWork mapWork) throws HiveExcept
     RowSchema rowSchema = findTsOp(mapWork).getSchema();
     final List<String> colNames = new ArrayList<String>(rowSchema.getSignature().size());
     final List<TypeInfo> colTypes = new ArrayList<TypeInfo>(rowSchema.getSignature().size());
-    boolean hasRowId = false;
+    ArrayList<VirtualColumn> virtualColumnList = new ArrayList<>(2);
     for (ColumnInfo c : rowSchema.getSignature()) {
       String columnName = c.getInternalName();
-      if (VirtualColumn.ROWID.getName().equals(columnName)) {
-        hasRowId = true;
+      if (ALLOWED_VIRTUAL_COLUMNS.containsKey(columnName)) {
+        virtualColumnList.add(ALLOWED_VIRTUAL_COLUMNS.get(columnName));
       } else {
         if (VirtualColumn.VIRTUAL_COLUMN_NAMES.contains(columnName)) continue;

Review comment:
       merged




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599394814



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1943,15 +1943,17 @@ public static boolean isFullAcidScan(Configuration conf) {
    *                   we assume this is a full transactional table.
    */
   public static void setAcidOperationalProperties(
-      Configuration conf, boolean isTxnTable, AcidOperationalProperties properties) {
+      Configuration conf, boolean isTxnTable, AcidOperationalProperties properties, boolean fetchDeletedRows) {

Review comment:
       Added back the old signature.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #2046: HIVE-24855: Introduce virtual colum ROW__IS__DELETED

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #2046:
URL: https://github.com/apache/hive/pull/2046#discussion_r599394313



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
##########
@@ -1970,8 +1970,22 @@ private int processQueryHint(ASTNode ast, QBParseInfo qbp, int posn) throws Sema
     String queryHintStr = ast.getText();
     LOG.debug("QUERY HINT: {} ", queryHintStr);
     try {
-      ASTNode hintNode = pd.parseHint(queryHintStr);
-      qbp.setHints(hintNode);
+      ASTNode hintListNode = pd.parseHint(queryHintStr);
+      qbp.setHints(hintListNode);
+      for (int i = 0; i < hintListNode.getChildCount(); ++i) {
+        ASTNode hintNode = (ASTNode) hintListNode.getChild(i);
+        if (hintNode.getChild(0).getType() != HintParser.TOK_FETCH_DELETED_ROWS) {

Review comment:
       Removed hint. Using `acid.fetch.deleted.rows` as table property.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org