You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/01/26 11:40:00 UTC

[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r792552279



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       1. I believe that in case there are files in the dir they already should be in the `obsolete` list; I just wanted to be conservative in this method - but I think returning true there would be correct as well
   2. the `highWatermark` is inclusive; but this method's name is isBelowWatermark - so it only looks for files which are below the watermark
     w.r.t to `delta_1_5` ; I think its not below `5` because it contains data from `writeId` 5.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org