You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/05/13 10:51:45 UTC

[GitHub] [hive] veghlaci05 opened a new pull request, #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

veghlaci05 opened a new pull request, #3289:
URL: https://github.com/apache/hive/pull/3289

   
   ### What changes were proposed in this pull request?
   This PR changes the commit time of the Fetch tasks. From now on these tasks are committed only upon driver close.
   
   ### Why are the changes needed?
   Fetch tasks were committed inside the org.apache.hadoop.hive.ql.Driver#run(java.lang.String) call which was too early. The reading can occur only after this point, which can cause issues, if the table changes during the read. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Manually, and through unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] veghlaci05 commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
veghlaci05 commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r908139462


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java:
##########
@@ -152,28 +152,32 @@ private FetchTask optimize(ParseContext pctx, String alias, TableScanOperator so
   }
 
   private boolean checkThreshold(FetchData data, int limit, ParseContext pctx) throws Exception {
-    if (limit > 0) {
-      if (data.hasOnlyPruningFilter()) {
-        /* partitioned table + query has only pruning filters */
-        return true;
-      } else if (data.isPartitioned() == false && data.isFiltered() == false) {
-        /* unpartitioned table + no filters */
-        return true;
+    boolean cachingEnabled = HiveConf.getBoolVar(pctx.getConf(), HiveConf.ConfVars.HIVEFETCHTASKCACHING);
+    if (!cachingEnabled) {
+      /* if caching is enabled we apply the treshold in all cases */

Review Comment:
   No, the if construct is fine, I moved this comment outside of the if block. It was  placed poorly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz merged pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
kasakrisz merged PR #3289:
URL: https://github.com/apache/hive/pull/3289


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904754762


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -134,39 +121,18 @@ public void setMaxRows(int maxRows) {
     this.maxRows = maxRows;
   }
 
-  public boolean fetch(List res) throws IOException {
-    sink.reset(res);
-    int rowsRet = work.getLeastNumRows();
-    if (rowsRet <= 0) {
-      rowsRet = work.getLimit() >= 0 ? Math.min(work.getLimit() - totalRows, maxRows) : maxRows;
-    }
-    try {
-      if (rowsRet <= 0 || work.getLimit() == totalRows) {
-        fetch.clearFetchContext();
+  public boolean fetch(List res) {
+    if (cachingEnabled) {
+      if (currentRow >= fetchedData.size()) {
         return false;
       }
-      boolean fetched = false;
-      while (sink.getNumRows() < rowsRet) {
-        if (!fetch.pushRow()) {
-          if (work.getLeastNumRows() > 0) {
-            throw new HiveException("leastNumRows check failed");
-          }
-
-          // Closing the operator can sometimes yield more rows (HIVE-11892)
-          fetch.closeOperator();
-
-          return fetched;
-        }
-        fetched = true;
-      }
-      return true;
-    } catch (IOException e) {
-      throw e;
-    } catch (Exception e) {
-      throw new IOException(e);
-    } finally {
-      totalRows += sink.getNumRows();
+      int toIndex = Math.min(fetchedData.size(), currentRow + maxRows);
+      res.addAll(fetchedData.subList(currentRow, toIndex));
+      currentRow = toIndex;
+    } else {
+      executeInner(res);

Review Comment:
   Wasn't this planned as:
   ```
   return executeInner(res);
   ```
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904745124


##########
iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergTimeTravel.java:
##########
@@ -50,10 +49,14 @@ public void testSelectAsOfTimestamp() throws IOException, InterruptedException {
 
     Assert.assertEquals(4, rows.size());
 
-    AssertHelpers.assertThrows("should throw exception", IllegalArgumentException.class,
-        "java.lang.IllegalArgumentException: Cannot find a snapshot older than 1970-01-01", () -> {
-        shell.executeStatement("SELECT * FROM customers FOR SYSTEM_TIME AS OF '1970-01-01 00:00:00'");
-        });
+    try {

Review Comment:
   Why is this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] veghlaci05 commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
veghlaci05 commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904762574


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -134,39 +121,18 @@ public void setMaxRows(int maxRows) {
     this.maxRows = maxRows;
   }
 
-  public boolean fetch(List res) throws IOException {
-    sink.reset(res);
-    int rowsRet = work.getLeastNumRows();
-    if (rowsRet <= 0) {
-      rowsRet = work.getLimit() >= 0 ? Math.min(work.getLimit() - totalRows, maxRows) : maxRows;
-    }
-    try {
-      if (rowsRet <= 0 || work.getLimit() == totalRows) {
-        fetch.clearFetchContext();
+  public boolean fetch(List res) {
+    if (cachingEnabled) {
+      if (currentRow >= fetchedData.size()) {
         return false;
       }
-      boolean fetched = false;
-      while (sink.getNumRows() < rowsRet) {
-        if (!fetch.pushRow()) {
-          if (work.getLeastNumRows() > 0) {
-            throw new HiveException("leastNumRows check failed");
-          }
-
-          // Closing the operator can sometimes yield more rows (HIVE-11892)
-          fetch.closeOperator();
-
-          return fetched;
-        }
-        fetched = true;
-      }
-      return true;
-    } catch (IOException e) {
-      throw e;
-    } catch (Exception e) {
-      throw new IOException(e);
-    } finally {
-      totalRows += sink.getNumRows();
+      int toIndex = Math.min(fetchedData.size(), currentRow + maxRows);
+      res.addAll(fetchedData.subList(currentRow, toIndex));
+      currentRow = toIndex;
+    } else {
+      executeInner(res);

Review Comment:
   good catch, I'll fix it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904752539


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -66,32 +70,13 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
     super.initialize(queryState, queryPlan, taskQueue, context);
     work.initializeForFetch(context.getOpContext());
 
+    cachingEnabled = HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVEFETCHTASKCACHING);
+    fetchedData = new ArrayList<>();
+
     try {
       // Create a file system handle
-      if (job == null) {
-        // The job config should be initilaized once per fetch task. In case of refetch, we should use the
-        // same config.
-        job = new JobConf(conf);
-      }
-
-      Operator<?> source = work.getSource();
-      if (source instanceof TableScanOperator) {
-        TableScanOperator ts = (TableScanOperator) source;
-        // push down projections
-        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
-                ts.getNeededNestedColumnPaths(), ts.getConf().hasVirtualCols());
-        // push down filters and as of information
-        HiveInputFormat.pushFiltersAndAsOf(job, ts, null);
-
-        AcidUtils.setAcidOperationalProperties(job, ts.getConf().isTranscationalTable(),
-            ts.getConf().getAcidOperationalProperties());
-      }
-      sink = work.getSink();
-      fetch = new FetchOperator(work, job, source, getVirtualColumns(source));
-      source.initialize(conf, new ObjectInspector[]{fetch.getOutputObjectInspector()});
-      totalRows = 0;
-      ExecMapper.setDone(false);
-
+      job = new JobConf(conf);

Review Comment:
   What happend with the comment?
   ```
   The job config should be initilaized once per fetch task. In case of refetch, we should use the same config.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] veghlaci05 commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
veghlaci05 commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904753607


##########
iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergTimeTravel.java:
##########
@@ -50,10 +49,14 @@ public void testSelectAsOfTimestamp() throws IOException, InterruptedException {
 
     Assert.assertEquals(4, rows.size());
 
-    AssertHelpers.assertThrows("should throw exception", IllegalArgumentException.class,
-        "java.lang.IllegalArgumentException: Cannot find a snapshot older than 1970-01-01", () -> {
-        shell.executeStatement("SELECT * FROM customers FOR SYSTEM_TIME AS OF '1970-01-01 00:00:00'");
-        });
+    try {

Review Comment:
   With the change the FetchTask now executed inside the transaction in the Driver, and as a result, if an exception is thrown, it is wrapped by CommandProcessorException. There is a follow-up task: [HIVE-26348](https://issues.apache.org/jira/browse/HIVE-26348) to improve the error message, because now the original reason is not shown to the user



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r907347860


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java:
##########
@@ -152,28 +152,32 @@ private FetchTask optimize(ParseContext pctx, String alias, TableScanOperator so
   }
 
   private boolean checkThreshold(FetchData data, int limit, ParseContext pctx) throws Exception {
-    if (limit > 0) {
-      if (data.hasOnlyPruningFilter()) {
-        /* partitioned table + query has only pruning filters */
-        return true;
-      } else if (data.isPartitioned() == false && data.isFiltered() == false) {
-        /* unpartitioned table + no filters */
-        return true;
+    boolean cachingEnabled = HiveConf.getBoolVar(pctx.getConf(), HiveConf.ConfVars.HIVEFETCHTASKCACHING);
+    if (!cachingEnabled) {
+      /* if caching is enabled we apply the treshold in all cases */

Review Comment:
   The `cachingEnabled` variable name and the comment suggest that true means caching is enabled but the `if` predicate is negated. Does is means that this branch should run when caching is disabled?



##########
ql/src/test/results/clientpositive/llap/nonmr_fetch_threshold.q.out:
##########
@@ -115,12 +115,15 @@ Plan optimized by CBO.
 Stage-0
   Fetch Operator
     limit:10
-    Select Operator [SEL_2]
-      Output:["_col0","_col1","_col2","_col3"]
-      Limit [LIM_3]
-        Number of rows:10
-        TableScan [TS_0]
-          Output:["key","value"]
+    Stage-1
+      Map 1 vectorized, llap
+      File Output Operator [FS_8]

Review Comment:
   It seems that this plan has a Map phase. Does this expected?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java:
##########
@@ -152,28 +152,32 @@ private FetchTask optimize(ParseContext pctx, String alias, TableScanOperator so
   }
 
   private boolean checkThreshold(FetchData data, int limit, ParseContext pctx) throws Exception {
-    if (limit > 0) {
-      if (data.hasOnlyPruningFilter()) {
-        /* partitioned table + query has only pruning filters */
-        return true;
-      } else if (data.isPartitioned() == false && data.isFiltered() == false) {
-        /* unpartitioned table + no filters */
-        return true;
+    boolean cachingEnabled = HiveConf.getBoolVar(pctx.getConf(), HiveConf.ConfVars.HIVEFETCHTASKCACHING);
+    if (!cachingEnabled) {
+      /* if caching is enabled we apply the treshold in all cases */
+      if (limit > 0) {
+        if (data.hasOnlyPruningFilter()) {
+          /* partitioned table + query has only pruning filters */

Review Comment:
   nit: I thinks we should use `//` for one line comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] veghlaci05 commented on a diff in pull request #3289: HIVE-25976: Cleaner may remove files being accessed from a fetch-task-converted reader

Posted by GitBox <gi...@apache.org>.
veghlaci05 commented on code in PR #3289:
URL: https://github.com/apache/hive/pull/3289#discussion_r904756053


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -66,32 +70,13 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
     super.initialize(queryState, queryPlan, taskQueue, context);
     work.initializeForFetch(context.getOpContext());
 
+    cachingEnabled = HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVEFETCHTASKCACHING);
+    fetchedData = new ArrayList<>();
+
     try {
       // Create a file system handle
-      if (job == null) {
-        // The job config should be initilaized once per fetch task. In case of refetch, we should use the
-        // same config.
-        job = new JobConf(conf);
-      }
-
-      Operator<?> source = work.getSource();
-      if (source instanceof TableScanOperator) {
-        TableScanOperator ts = (TableScanOperator) source;
-        // push down projections
-        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
-                ts.getNeededNestedColumnPaths(), ts.getConf().hasVirtualCols());
-        // push down filters and as of information
-        HiveInputFormat.pushFiltersAndAsOf(job, ts, null);
-
-        AcidUtils.setAcidOperationalProperties(job, ts.getConf().isTranscationalTable(),
-            ts.getConf().getAcidOperationalProperties());
-      }
-      sink = work.getSink();
-      fetch = new FetchOperator(work, job, source, getVirtualColumns(source));
-      source.initialize(conf, new ObjectInspector[]{fetch.getOutputObjectInspector()});
-      totalRows = 0;
-      ExecMapper.setDone(false);
-
+      job = new JobConf(conf);

Review Comment:
   Initialize now called only once, FetchTask reinitalization moved to the private initFetch() method which is called from resetFetch(). So from now on there is no need to check if the field has already initialized or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org