You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/12/01 17:52:50 UTC

[GitHub] [hive] pvargacl commented on a change in pull request #1716: HIVE-24444: compactor.Cleaner should not set state 'mark cleaned' if there are obsolete files in the FS

pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533608167



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
       }
       fs.delete(dead, true);
     }
-    return true;
+    // Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this
+    // number reaches 0.
+    return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map<Path, AcidUtils.HdfsDirSnapshot> dirSnapshots)
+      throws IOException {
+    ValidTxnList validTxnList = new ValidReadTxnList();
+    //save it so that getAcidState() sees it
+    conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+    ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+    Path locPath = new Path(location);
+    AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList,
+        Ref.from(false), false, dirSnapshots);
+    return dir.getObsolete().size();

Review comment:
       I think checking the aborted is a very problematic in upstream, if you have transaction coming in continuously and some of them aborting cleaner won't be able to mark any compaction cleaned. Example:
   txn1-10 coming in some of them writing, txnid=8 is aborting
   compaction is kicking in, compacts the deltas, will have the next_txn_id for example txnId=13
   other transactions are coming in txnid=18 is aborting
   cleaner will start when everything is done under 13, it can clean everything. but since txnId=14 is open it will not see txnId=18, so when it checks if can mark itself cleaned, it will not do it, because the check sees txnId=18 aborted.
   Second compaction kicks in, will have next_txn_id for example txnId=23
   other transactions are coming in txnid=28 is aborting.
   Again cleaner can not mark compaction 1 or compaction 2 cleaned, as none of them see 28 and can not clean it. this will pile up and Cleaner only mark them cleaned, if finally there is an iteration, when there was no new aborted transaction.
   If you check only the obsoleted files, you only miss marking the compaction cleaned, when you should have, if you have two overlapping compaction, but in that case the second compaction will get marked as cleaned and the next cleaner run, compaction 1 will be marked as cleaned too. This too can cause problems if you enabled delayed cleaner, but probably less




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org