You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/01/24 19:38:39 UTC

[GitHub] [hive] kgyrtkirk opened a new pull request #2971: HIVE-25883: cleaner skip addendum

kgyrtkirk opened a new pull request #2971:
URL: https://github.com/apache/hive/pull/2971


   (cherry picked from commit 4c742cd5b0c523e1c102faa109b66bd92aac2ece)
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r817789462



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -439,29 +440,40 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark,
+      long minOpenTxn)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
-      if (isFileBelowWatermark(child, highWatermark)) {
+      if (isFileBelowWatermark(child, highWatermark, minOpenTxn)) {
         return true;
       }
     }
     return false;
   }
 
-  private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
+  private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long minOpenTxn) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark && b.getVisibilityTxnId() <= minOpenTxn;

Review comment:
       I can't remember if we discussed this. What if the table has files:
   ```
   base_2_v5
   delta_3_3
   delta_4_4
   base_4_v16 (this compaction)
   ``` 
   
   
   hwm=4
   
   Say the worker is older and did not populate CQ_NEXT_TXN_ID, so the cleaner will pick up any compaction in "ready for cleaning"
   Say minTxnId=13
   The AcidDirectory will contain:
   ```
   base_2_v5
   delta_3_3
   delta_4_4
   ```
   
   So isFileBelowWatermark will investigate:
   ```
   base_4_v16
   ```
   And return false because 4=b.getWriteId() <= highWatermark=4 is true and 16=b.getVisibilityTxnId() <= minOpenTxn=13 is false.
   
   
   Also I probably have amnesia but can you remind me of the rationale behind `b.getVisibilityTxnId() <= minOpenTxn` here?
   Also keep in mind that the visibilityTxnId is always 0 unless the delta/base is a product of compaction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r820588551



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -469,11 +469,11 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() <= highWatermark;
+      return b.getWriteId() < highWatermark;
     }
     if (fn.startsWith(AcidUtils.DELTA_PREFIX) || fn.startsWith(AcidUtils.DELETE_DELTA_PREFIX)) {
       ParsedDeltaLight d = ParsedDeltaLight.parse(p);
-      return d.getMaxWriteId() <= highWatermark;

Review comment:
       I think you still need the <=, especially because CQ_COMMIT_TIME might be NULL




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r817789462



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -439,29 +440,40 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark,
+      long minOpenTxn)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
-      if (isFileBelowWatermark(child, highWatermark)) {
+      if (isFileBelowWatermark(child, highWatermark, minOpenTxn)) {
         return true;
       }
     }
     return false;
   }
 
-  private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
+  private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long minOpenTxn) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark && b.getVisibilityTxnId() <= minOpenTxn;

Review comment:
       I can't remember if we discussed this. What if the table has files:
   ```
   base_2_v5
   delta_3_3
   delta_4_4
   base_4_v16
   ``` (this compaction)
   
   hwm=4
   
   Say the worker is older and did not populate CQ_NEXT_TXN_ID, so the cleaner will pick up any compaction in "ready for cleaning"
   Say minTxnId=13
   The AcidDirectory will contain:
   base_2_v5
   delta_3_3
   delta_4_4
   
   So isFileBelowWatermark will investigate:
   base_4_v16
   And return false because 4=b.getWriteId() <= highWatermark=4 is true and 16=b.getVisibilityTxnId() <= minOpenTxn=13 is false.
   
   
   Also I probably have amnesia but can you remind me of the rationale behind `b.getVisibilityTxnId() <= minOpenTxn` here?
   Also keep in mind that the visibilityTxnId is always 0 unless the delta/base is a product of compaction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk closed pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
kgyrtkirk closed pull request #2971:
URL: https://github.com/apache/hive/pull/2971


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793491772



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;
     }
     if (fn.startsWith(AcidUtils.DELTA_PREFIX) || fn.startsWith(AcidUtils.DELETE_DELTA_PREFIX)) {
       ParsedDeltaLight d = ParsedDeltaLight.parse(p);
-      return d.getMaxWriteId() < highWatermark;
+      return d.getMaxWriteId() <= highWatermark;

Review comment:
       I think at some point
   * the delta/base will become visible 
   * then the real acid dir will be using these files
   * and the current one will become candidates for the cleaner
   
   so I think the current logic makes sense




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793559890



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       The AcidDirectory could be based off of some older txn id (depending on how the Cleaner selects which compaction to pick up – if CQ_NEXT_TXN_ID is filled in the cleaner will only attempt to clean when it can, however if CQ_NEXT_TXN_ID==NULL then the cleaner will try to clean even if it's blocked by an obsolete dir). It could be that txnId=99 was opened, table a was created, table a is compacted many times but the cleaner will never clean anything from table a because txid=99 sees table a as totally empty.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793757325



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       @kgyrtkirk, could you please try the following test case (simulate legacy behavior when CQ_NEXT_TXN_ID=NULL by commenting out updateWSCommitIdAndCleanUpMetadata in CompactionTxnHandler):
   ````
         @Test
     public void testReady() throws Exception {
       String dbName = "default";
       String tblName = "trfcp";
       String partName = "ds=today";
       Table t = newTable(dbName, tblName, true);
       Partition p = newPartition(t, "today");
   
       // minor compaction
       addBaseFile(t, p, 19L, 19);
       addDeltaFile(t, p, 20L, 20L, 1);
       addDeltaFile(t, p, 21L, 21L, 1);
       addDeltaFile(t, p, 22L, 22L, 1);
       burnThroughTransactions(dbName, tblName, 22);
       
       // block cleaner with an open txn
       long blockingTxn = openTxn();
       
       CompactionRequest rqst = new CompactionRequest(dbName, tblName, CompactionType.MINOR);
       rqst.setPartitionname(partName);
       long ctxnid = compactInTxn(rqst);
       addDeltaFile(t, p, 20, 22, 2, ctxnid);
       startCleaner();
   
       // make sure cleaner didn't remove anything, and cleaning is still queued
       List<Path> paths = getDirectories(conf, t, p);
       Assert.assertEquals("Expected 5 files after minor compaction, instead these files were present " + paths,
         5, paths.size());
       ShowCompactResponse rsp = txnHandler.showCompact(new ShowCompactRequest());
       Assert.assertEquals("Expected 1 compaction in queue, got: " + rsp.getCompacts(), 1, rsp.getCompactsSize());
       Assert.assertEquals(TxnStore.CLEANING_RESPONSE, rsp.getCompacts().get(0).getState());
   }
   ```` 
   We cannot simply filter out current and base dirs cleanup of which is blocked by openTxn. This is breaking the idea implemented in HIVE-24314. If there is a blocking open txn, retry until it's gone and we could proceed with cleanup. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r792867605



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       after some thinking I convinced myself that you are right  :)
   
   * I've changed to return `true` for non-directories
   ** in the background these should appear as `obsolete` files anyway; so it should not cause any real trouble in the scope of HIVE-25883; but it makes the method live up to its name...
   * since the latest patch we are checking and excluding all the dirs the actual acid dir is using - so if we have anything below or even at the writeid level that should be considered invalid; the `nothingToCleanAfterAbortsDelta` testcase is a "complicated" case but the default case is similar to this with the new checks.
   
   pushed a new commit to update these




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793559890



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       The AcidDirectory could be based off of some older txn id (depending on how the Cleaner selects which compaction to pick up – if CQ_NEXT_TXN_ID is filled in the cleaner will only attempt to clean when it can, however if CQ_NEXT_TXN_ID==NULL then the cleaner will try to clean even if it's blocked by an obsolete dir). It could be that txnId=99 was opened, table a was created, table a is compacted many times but the cleaner will never clean anything from table a until txnid=99 is closed, because txnid=99 sees table a as totally empty.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793732688



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       @klcopp the current logic is that if there is an open txn started before Compaction txn committed, we won't even pick this request for cleanup. 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #2971:
URL: https://github.com/apache/hive/pull/2971


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793418454



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       This is good, but should also make sure that this base isn't the recently compacted base (AcidUtils#isCompactedBase should help).

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;
     }
     if (fn.startsWith(AcidUtils.DELTA_PREFIX) || fn.startsWith(AcidUtils.DELETE_DELTA_PREFIX)) {
       ParsedDeltaLight d = ParsedDeltaLight.parse(p);
-      return d.getMaxWriteId() < highWatermark;
+      return d.getMaxWriteId() <= highWatermark;

Review comment:
       Same as above: the delta might be the result of the latest compaction (visibilityTxnId > 0)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r792628396



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       1.
   > I believe that in case there are files in the dir they already should be in the obsolete list
   
   Not necessarily, because the AcidDirectory the Cleaner uses is computed based on an older txnId (cleanerWaterMark), so there is a chance its obsolete list does not contain files that should be cleaned up eventually, which is what this method is supposed to figure out. (Right?)
   
   @deniskuzZ please correct me if I'm wrong about this since I know there have been recent changes to this logic
   
   2. I meant that if the table dir contains:
   
   - delta_5_5
   - delta_1_5_v100 (minor compacted)
   
   Then the cleaner should eventually remove delta_5_5, so there will be files to remove later, when the cleanerWaterMark is high enough




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793757325



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       @kgyrtkirk, could you please try the following test case (simulate legacy behavior when CQ_NEXT_TXN_ID=NULL by commenting out updateWSCommitIdAndCleanUpMetadata in CompactionTxnHandler):
   ````
       String dbName = "default";
       String tblName = "trfcp";
       String partName = "ds=today";
       Table t = newTable(dbName, tblName, true);
       Partition p = newPartition(t, "today");
   
       // block cleaner with an open txn
       long blockingTxn = openTxn();
   
       // minor compaction
       addBaseFile(t, p, 20L, 20);
       addDeltaFile(t, p, 21L, 21L, 1);
       addDeltaFile(t, p, 22L, 22L, 1);
       burnThroughTransactions(dbName, tblName, 22);
       CompactionRequest rqst = new CompactionRequest(dbName, tblName, CompactionType.MINOR);
       rqst.setPartitionname(partName);
       compactInTxn(rqst);
       addDeltaFile(t, p, 21, 22, 2);
       startCleaner();
   
       // make sure cleaner didn't remove anything, and cleaning is still queued
       List<Path> paths = getDirectories(conf, t, p);
       Assert.assertEquals("Expected 4 files after minor compaction, instead these files were present " + paths,
         4, paths.size());
       ShowCompactResponse rsp = txnHandler.showCompact(new ShowCompactRequest());
       Assert.assertEquals("Expected 1 compaction in queue, got: " + rsp.getCompacts(), 1, rsp.getCompactsSize());
       Assert.assertEquals(TxnStore.CLEANING_RESPONSE, rsp.getCompacts().get(0).getState());
       Assert.assertEquals(CompactionType.MINOR, rsp.getCompacts().get(0).getType());
   ```` 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793858490



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       note, there is shitty/unreliable logic for openTxn. When running the above test, put a breakpoint in Cleaner at  cleanerWaterMark and set minTxnIdSeenOpen to 23 (it might be 1, due to artificial commit txn in TXNS table)   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r792524567



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       Commenting on the contents of isFileBelowWatermark since I can't comment there...
   
   1. `if (!child.isDirectory()) {
         return false;
       }`
   There could be original files in the table directory that should be deleted.
   
   2. `return b.getWriteId() < highWatermark;`
   the highWatermark is inclusive, so if for some reason the table directory contains:
   delta_5_5
   delta_1_5_v100 (minor compacted) -- this includes the data in delta_5_5.
   then isFileBelowWatermark would return false but it should return true.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r792552279



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       1. I believe that in case there are files in the dir they already should be in the `obsolete` list; I just wanted to be conservative in this method - but I think returning true there would be correct as well
   2. the `highWatermark` is inclusive; but this method's name is isBelowWatermark - so it only looks for files which are below the watermark
     w.r.t to `delta_1_5` ; I think its not below `5` because it contains data from `writeId` 5.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r820684883



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -469,11 +469,11 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() <= highWatermark;
+      return b.getWriteId() < highWatermark;
     }
     if (fn.startsWith(AcidUtils.DELTA_PREFIX) || fn.startsWith(AcidUtils.DELETE_DELTA_PREFIX)) {
       ParsedDeltaLight d = ParsedDeltaLight.parse(p);
-      return d.getMaxWriteId() <= highWatermark;

Review comment:
       exactly :facepalm: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r825828027



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/txn/compactor/CompactorTest.java
##########
@@ -271,6 +273,11 @@ protected void addLegacyFile(Table t, Partition p, int numRecords) throws Except
     addFile(t, p, 0, 0, numRecords, FileType.LEGACY, 2, true);
   }
 
+  protected void addDeltaFile(Table t, Partition p, long minTxn, long maxTxn, int numRecords,
+      long visibilityId) throws Exception {
+    addFile(t, p, minTxn, maxTxn, numRecords, FileType.DELTA, 2, true, visibilityId);

Review comment:
       this change has 0 effect, as it visibilityId is not used in delta file creation. Should be 
   ````
   case DELTA: filename = AcidUtils.addVisibilitySuffix(
       makeDeltaDirName(minTxn, maxTxn),
       visibilityId); 
   ````




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r794396068



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       You're right. Might be worth adding test cases to make sure that the Worker#isEnoughToCompact logic isn't changed though, because if it is then it will break the cleaner.
   e.g. 
   1. CQ_NEXT_TXN_ID is never updated (set it to NULL before the cleaner runs) and table dir contains base_7
   2. CQ_NEXT_TXN_ID is not updated (set it to NULL before the cleaner runs) and table dir contains delta_1_7 and base_7




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793665040



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       I think this will not cause any problem - as I think this means the following:
   * until txind=99 is closed the aciddir looks empty
   * however we will have some dir which is not yet visible below the watermark => which will prevent the clean request from being removed
   
   I don't think the above is incorrect behaviour - or could lead to the retention of `ready for cleaning` for indefinite times?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793859682



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -434,8 +437,18 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
       if (isFileBelowWatermark(child, highWatermark)) {

Review comment:
       + minor change in compacted delta file that lacks visibility id
   ````
   case DELTA: filename = AcidUtils.addVisibilitySuffix(makeDeltaDirName(minTxn, maxTxn),visibilityId); break;
   ````




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25883: cleaner skip addendum

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r793490303



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -448,15 +461,15 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark;

Review comment:
       I don't think we would need any more checks - we already make sure that we ignore the files we use for the real acid dir:
   * if we have anything beyond that below the watermark we should keep the request
   * if there is nothing the request should be discarded...
   
   I don't think we have any more problematic cases; do you have a testcase in mind?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r820544902



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -439,29 +440,40 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa
     return success;
   }
 
-  private boolean hasDataBelowWatermark(FileSystem fs, Path path, long highWatermark) throws IOException {
-    FileStatus[] children = fs.listStatus(path);
+  private boolean hasDataBelowWatermark(AcidDirectory acidDir, FileSystem fs, Path path, long highWatermark,
+      long minOpenTxn)
+      throws IOException {
+    Set<Path> acidPaths = new HashSet<>();
+    for (ParsedDelta delta : acidDir.getCurrentDirectories()) {
+      acidPaths.add(delta.getPath());
+    }
+    if (acidDir.getBaseDirectory() != null) {
+      acidPaths.add(acidDir.getBaseDirectory());
+    }
+    FileStatus[] children = fs.listStatus(path, p -> {
+      return !acidPaths.contains(p);
+    });
     for (FileStatus child : children) {
-      if (isFileBelowWatermark(child, highWatermark)) {
+      if (isFileBelowWatermark(child, highWatermark, minOpenTxn)) {
         return true;
       }
     }
     return false;
   }
 
-  private boolean isFileBelowWatermark(FileStatus child, long highWatermark) {
+  private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long minOpenTxn) {
     Path p = child.getPath();
     String fn = p.getName();
     if (!child.isDirectory()) {
-      return false;
+      return true;
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() < highWatermark;
+      return b.getWriteId() <= highWatermark && b.getVisibilityTxnId() <= minOpenTxn;

Review comment:
       yes - this seemed fishy; this was changed during that zoom call; I changed this back to `b.getWriteId() < highWatermark` because I think thats the right conditional for this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] klcopp commented on a change in pull request #2971: HIVE-25977: Enhance Compaction Cleaner to skip when there is nothing to do #2

Posted by GitBox <gi...@apache.org>.
klcopp commented on a change in pull request #2971:
URL: https://github.com/apache/hive/pull/2971#discussion_r820588551



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -469,11 +469,11 @@ private boolean isFileBelowWatermark(FileStatus child, long highWatermark, long
     }
     if (fn.startsWith(AcidUtils.BASE_PREFIX)) {
       ParsedBaseLight b = ParsedBaseLight.parseBase(p);
-      return b.getWriteId() <= highWatermark;
+      return b.getWriteId() < highWatermark;
     }
     if (fn.startsWith(AcidUtils.DELTA_PREFIX) || fn.startsWith(AcidUtils.DELETE_DELTA_PREFIX)) {
       ParsedDeltaLight d = ParsedDeltaLight.parse(p);
-      return d.getMaxWriteId() <= highWatermark;

Review comment:
       I think you still need the <=, especially because CQ_COMMIT_TIME might be NULL
   Edit: [example](https://github.com/apache/hive/pull/2971#discussion_r817789462)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org