You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/06/22 17:33:39 UTC

[GitHub] [hive] coufon opened a new pull request #2421: HIVE-25277: Avoid any duplicated calls to isEmpty when delete a partition.

coufon opened a new pull request #2421:
URL: https://github.com/apache/hive/pull/2421


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Deleting a Hive partition is slow when use a Cloud object store as the warehouse for which ListFiles is expensive. A root cause is that the recursive parent dir deletion is very inefficient: there are many duplicated calls to isEmpty (ListFiles is called at the end). This fix sorts the parents to delete according to the path size, and always processes the longest one (e.g., a/b/c is always before a/b). As a result, each parent path is only needed to be checked once.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   It improves partition deletion time. The improvement is around 7X in our test.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Tested partition deletion end-to-end.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702365874



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5156,14 +5155,40 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndDepth implements Comparable<PathAndDepth> {
+
+      final Path path;
+      final int depth;
+
+      public PathAndDepth(Path path, int depth) {
+        this.path = path;
+        this.depth = depth;
+      }
+
+      @Override
+      public int hashCode() {
+        return Objects.hash(path.hashCode(), depth);
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndDepth)) {
+          return false;
+        }
+        PathAndDepth p = (PathAndDepth) o;
+        return path.equals(p.path) && depth == p.depth;

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sunchao commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702364444



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5156,14 +5155,40 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndDepth implements Comparable<PathAndDepth> {
+
+      final Path path;
+      final int depth;
+
+      public PathAndDepth(Path path, int depth) {
+        this.path = path;
+        this.depth = depth;
+      }
+
+      @Override
+      public int hashCode() {
+        return Objects.hash(path.hashCode(), depth);
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndDepth)) {
+          return false;
+        }
+        PathAndDepth p = (PathAndDepth) o;
+        return path.equals(p.path) && depth == p.depth;

Review comment:
       we can just use auto-generated `equals` method and should handle null here too:
   ```java
         @Override
         public boolean equals(Object o) {
           if (this == o) return true;
           if (o == null || getClass() != o.getClass()) return false;
           PathAndDepth that = (PathAndDepth) o;
           return depth == that.depth && Objects.equals(path, that.path);
         }
   ```

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5156,14 +5155,40 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndDepth implements Comparable<PathAndDepth> {
+

Review comment:
       nit: extra empty line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] medb commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
medb commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r658031926



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5088,10 +5089,8 @@ private boolean isDatabaseRemote(String name) {
 
   private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean needRecycle)

Review comment:
       Is there a reason why it can not use [HCFS API to delete dir recursively](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#delete-org.apache.hadoop.fs.Path-boolean-)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-882711397


   > Zoltan Haindrich Haymant Mangla Naveen Gangam may you take a look and merge this PR?
   
   Thank you @medb! @kgyrtkirk @hmangla98 @nrg4878 a friendly ping, could you please take a look? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r658120606



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5088,10 +5089,8 @@ private boolean isDatabaseRemote(String name) {
 
   private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean needRecycle)

Review comment:
       I think it works if deleteDir fails when the dir is non-empty with no side effect. But we need to make sure all filesystems that Hive supports work in this way. Also I noticed that deleteDir calls moveToTrash first, so it could be more complex: https://github.com/apache/hive/blob/f2de30ca8bc2b63887496775f9a0769057a17ee0/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreFsImpl.java#L41. Avoiding duplicated checks seems to be safer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] medb edited a comment on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
medb edited a comment on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-880798336


   @kgyrtkirk @hmangla98 may you take a look and merge this PR?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-913086678


   > Thanks Zhou Fang . LGTM with two nits.
   
   Thank you Chao for the helpful comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ywskycn commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
ywskycn commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-912244652


   @sunchao help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702075034



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] medb commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
medb commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r658092035



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5088,10 +5089,8 @@ private boolean isDatabaseRemote(String name) {
 
   private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean needRecycle)

Review comment:
       I see. Did we consider just calling a non-recursive `deleteDir` on parent instead of checking if it's empty and based on the delete success/failure try to delete its parent recursively?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ranu010101 commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
ranu010101 commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r658051894



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5088,10 +5089,8 @@ private boolean isDatabaseRemote(String name) {
 
   private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean needRecycle)

Review comment:
       HCFS Api deletes all children recursively while this method (deleteParentRecursive) deletes a file and keeps on deleting parent directories if they are empty.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702365624



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5156,14 +5155,40 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndDepth implements Comparable<PathAndDepth> {
+

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702075927



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sunchao commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702046193



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndPartValSize)) {
+          return false;
+        }
+        return path.equals(((PathAndPartValSize) o).path);
+      }
+
+      /** The highest {@code partValSize} is processed first in a {@link PriorityQueue}. */
+      @Override
+      public int compareTo(PathAndPartValSize o) {
+        return ((PathAndPartValSize) o).partValSize - partValSize;

Review comment:
       nit: the cast `(PathAndPartValSize) o` is unnecessary

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {

Review comment:
       we should also implement `hashCode` together with `equals`?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5240,16 +5259,38 @@ public DropPartitionsResult drop_partitions_req(
         for (Path path : archToDelete) {
           wh.deleteDir(path, true, mustPurge, needsCm);
         }
+
+        // Uses a priority queue to delete the parents of deleted directories if empty.
+        // The parent with the largest size is always processed first. It guarantees that
+        // the emptiness of a parent won't be changed once it has been processed. So duplicated
+        // processing can be avoided.
+        PriorityQueue<PathAndPartValSize> parentsToDelete = new PriorityQueue<>();
         for (PathAndPartValSize p : dirsToDelete) {
           wh.deleteDir(p.path, true, mustPurge, needsCm);
+          addParentForDel(parentsToDelete, p);
+        }
+
+        HashSet<PathAndPartValSize> processed = new HashSet<>();
+        while (!parentsToDelete.isEmpty()) {
           try {
-            deleteParentRecursive(p.path.getParent(), p.partValSize - 1, mustPurge, needsCm);
+            PathAndPartValSize p = parentsToDelete.poll();
+            if (processed.contains(p)) {
+              continue;
+            }
+            processed.add(p);
+
+            Path path = p.path;
+            if (wh.isWritable(path) && wh.isDir(path) && wh.isEmptyDir(path)) {

Review comment:
       `wh.isDir(path) && wh.isEmptyDir(path)` seems duplicated? why we need `wh.isDir` if we already have `wh.isEmptyDir`?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndPartValSize)) {
+          return false;
+        }
+        return path.equals(((PathAndPartValSize) o).path);

Review comment:
       why we only compare `path` but not `partValSize`?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;

Review comment:
       nit: we can just use package-private instead of `public`. Also these should be final

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {

Review comment:
       I think it's clearer to call this `PathAndDepth` - `PartValSize` is confusing.

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5240,16 +5259,38 @@ public DropPartitionsResult drop_partitions_req(
         for (Path path : archToDelete) {
           wh.deleteDir(path, true, mustPurge, needsCm);
         }
+
+        // Uses a priority queue to delete the parents of deleted directories if empty.
+        // The parent with the largest size is always processed first. It guarantees that

Review comment:
       nit: `The parent with the largest size is always processed first ` -> `Parents with the deepest path are always processed first`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702088341



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndPartValSize)) {
+          return false;
+        }
+        return path.equals(((PathAndPartValSize) o).path);
+      }
+
+      /** The highest {@code partValSize} is processed first in a {@link PriorityQueue}. */
+      @Override
+      public int compareTo(PathAndPartValSize o) {
+        return ((PathAndPartValSize) o).partValSize - partValSize;

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sunchao commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-913087234


   Merged. Thanks!
   
   BTW if you want this in other branches, please open backport PRs against them accordingly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702082867



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702088091



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5104,14 +5103,34 @@ public boolean drop_partition(final String db_name, final String tbl_name,
         null);
   }
 
-  private static class PathAndPartValSize {
-    PathAndPartValSize(Path path, int partValSize) {
-      this.path = path;
-      this.partValSize = partValSize;
+    /** Stores a path and its size. */
+    private static class PathAndPartValSize implements Comparable<PathAndPartValSize> {
+
+      public Path path;
+      int partValSize;
+
+      public PathAndPartValSize(Path path, int partValSize) {
+        this.path = path;
+        this.partValSize = partValSize;
+      }
+
+      @Override
+      public boolean equals(Object o) {
+        if (o == this) {
+          return true;
+        }
+        if (!(o instanceof PathAndPartValSize)) {
+          return false;
+        }
+        return path.equals(((PathAndPartValSize) o).path);

Review comment:
       Nice catch. It is a bug. The current code actually didn't correctly implement the HashSet, it just used the hashcode of the object but not (path, depth) pair.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702097544



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5240,16 +5259,38 @@ public DropPartitionsResult drop_partitions_req(
         for (Path path : archToDelete) {
           wh.deleteDir(path, true, mustPurge, needsCm);
         }
+
+        // Uses a priority queue to delete the parents of deleted directories if empty.
+        // The parent with the largest size is always processed first. It guarantees that
+        // the emptiness of a parent won't be changed once it has been processed. So duplicated
+        // processing can be avoided.
+        PriorityQueue<PathAndPartValSize> parentsToDelete = new PriorityQueue<>();
         for (PathAndPartValSize p : dirsToDelete) {
           wh.deleteDir(p.path, true, mustPurge, needsCm);
+          addParentForDel(parentsToDelete, p);
+        }
+
+        HashSet<PathAndPartValSize> processed = new HashSet<>();
+        while (!parentsToDelete.isEmpty()) {
           try {
-            deleteParentRecursive(p.path.getParent(), p.partValSize - 1, mustPurge, needsCm);
+            PathAndPartValSize p = parentsToDelete.poll();
+            if (processed.contains(p)) {
+              continue;
+            }
+            processed.add(p);
+
+            Path path = p.path;
+            if (wh.isWritable(path) && wh.isDir(path) && wh.isEmptyDir(path)) {

Review comment:
       wh.isEmptyDir uses listStatus that doesn't distinguish file and dir (at least for the GCS fs implementation: https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/7825ab50c839aea43f1ff587b0e2803047af99bc/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java#L997). But I agree that isEmptyDir is enough no matter the path is a file or dir.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] medb edited a comment on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
medb edited a comment on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-880798336


   @kgyrtkirk @hmangla98 @nrg4878 may you take a look and merge this PR?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r658120606



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5088,10 +5089,8 @@ private boolean isDatabaseRemote(String name) {
 
   private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean needRecycle)

Review comment:
       I think it works if deleteDir fails without side effect when the dir is non-empty. But we need to make sure all filesystems that Hive supports work in this way. Also I noticed that deleteDir calls moveToTrash first, so it could be more complex: https://github.com/apache/hive/blob/f2de30ca8bc2b63887496775f9a0769057a17ee0/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreFsImpl.java#L41. Avoiding duplicated checks seems to be safer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-882711397


   > Zoltan Haindrich Haymant Mangla Naveen Gangam may you take a look and merge this PR?
   
   Thank you @medb! @kgyrtkirk @hmangla98 @nrg4878 a friendly ping, could you please take a look? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] medb commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
medb commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-880798336


   @kgyrtkirk may you take a look and merge this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on a change in pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702090099



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5240,16 +5259,38 @@ public DropPartitionsResult drop_partitions_req(
         for (Path path : archToDelete) {
           wh.deleteDir(path, true, mustPurge, needsCm);
         }
+
+        // Uses a priority queue to delete the parents of deleted directories if empty.
+        // The parent with the largest size is always processed first. It guarantees that

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sunchao merged pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
sunchao merged pull request #2421:
URL: https://github.com/apache/hive/pull/2421


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] coufon commented on pull request #2421: HIVE-25277: fix slow partition deletion issue by removing duplicated isEmpty checks.

Posted by GitBox <gi...@apache.org>.
coufon commented on pull request #2421:
URL: https://github.com/apache/hive/pull/2421#issuecomment-882711397


   > Zoltan Haindrich Haymant Mangla Naveen Gangam may you take a look and merge this PR?
   
   Thank you @medb! @kgyrtkirk @hmangla98 @nrg4878 a friendly ping, could you please take a look? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org