You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/09/08 09:27:10 UTC

[GitHub] [hive] pudidic opened a new pull request, #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

pudidic opened a new pull request, #3582:
URL: https://github.com/apache/hive/pull/3582

   ### What changes were proposed in this pull request?
   Changed FileUtils.copy() to skip identical files on the destination directory to improve copy performance. FileUtils.copy() originally just removed and recreated the destination directory. This change makes it compare each file and directory, and delete only different files and directories.
   
   ### Why are the changes needed?
   In an optimized replication bootstrap scenario, it copies many files from source to destination. It can copy thousands of files. If it fails during copying process, it retries. Then it has some files already copied, but its implementation removes them and copy all of them entirely. It should skip the already copied ones.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   It introduced few JUnit test scenarios in TestFileUtils and TestCopyUtils. It will be tested with automated regression test suites on the test server.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pudidic commented on pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
pudidic commented on PR #3582:
URL: https://github.com/apache/hive/pull/3582#issuecomment-1248969849

   I found a couple of timeouts on the previous run. I found that CopyUtils.leaveIdenticalFilesOnly should compare the source path with its corresponding path under the destination path, but it actually compared the source path with the destination path. So I fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sonarcloud[bot] commented on pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
sonarcloud[bot] commented on PR #3582:
URL: https://github.com/apache/hive/pull/3582#issuecomment-1240554900

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3582)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [6 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL)
   
   [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=coverage&view=list) No Coverage information  
   [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=duplicated_lines_density&view=list) No Duplication information
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sonarcloud[bot] commented on pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
sonarcloud[bot] commented on PR #3582:
URL: https://github.com/apache/hive/pull/3582#issuecomment-1249001746

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3582)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [7 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL)
   
   [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=coverage&view=list) No Coverage information  
   [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=duplicated_lines_density&view=list) No Duplication information
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a diff in pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on code in PR #3582:
URL: https://github.com/apache/hive/pull/3582#discussion_r1012649913


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:
+      // If destination is none, then no need to delete anything.
+      break;
+    case FILE:
+      // If destination is a file,
+      if (srcType == Type.FILE) {
+        // If source is a file, then delete the destination file if it is not same as source.
+        toDelete = !FileUtils.isSameFile(sourceFs, srcPath, destinationFs, dstPath);
+        LOG.debug("Same file?: {}", !toDelete);
+      } else {
+        // Otherwise, delete the destination file.
+        toDelete = true;
+      }
+      break;
+    case DIRECTORY:
+      // If destination is a directory,
+      if (srcType == Type.DIRECTORY) {
+        // If both are directories, visit for children of both.
+        Set<String> bothChildNames = Stream.concat(

Review Comment:
   why do you need concat? I think idea was to remove target-only files that are not identical to the source. You should only compare the common set, isn't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] sonarcloud[bot] commented on pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
sonarcloud[bot] commented on PR #3582:
URL: https://github.com/apache/hive/pull/3582#issuecomment-1250912289

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3582)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG) [2 Bugs](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3582&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL) [10 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3582&resolved=false&types=CODE_SMELL)
   
   [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=coverage&view=list) No Coverage information  
   [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3582&metric=duplicated_lines_density&view=list) No Duplication information
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a diff in pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on code in PR #3582:
URL: https://github.com/apache/hive/pull/3582#discussion_r1012655098


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:
+      // If destination is none, then no need to delete anything.
+      break;
+    case FILE:
+      // If destination is a file,
+      if (srcType == Type.FILE) {
+        // If source is a file, then delete the destination file if it is not same as source.
+        toDelete = !FileUtils.isSameFile(sourceFs, srcPath, destinationFs, dstPath);
+        LOG.debug("Same file?: {}", !toDelete);
+      } else {
+        // Otherwise, delete the destination file.
+        toDelete = true;
+      }
+      break;
+    case DIRECTORY:
+      // If destination is a directory,
+      if (srcType == Type.DIRECTORY) {
+        // If both are directories, visit for children of both.
+        Set<String> bothChildNames = Stream.concat(
+            Arrays.stream(sourceFs.listStatus(srcPath))

Review Comment:
   you are listing plenty of times: here and when calling getType, could this be optimized?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] closed pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)
URL: https://github.com/apache/hive/pull/3582


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a diff in pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on code in PR #3582:
URL: https://github.com/apache/hive/pull/3582#discussion_r1012643637


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:
+      // If destination is none, then no need to delete anything.
+      break;
+    case FILE:
+      // If destination is a file,
+      if (srcType == Type.FILE) {
+        // If source is a file, then delete the destination file if it is not same as source.
+        toDelete = !FileUtils.isSameFile(sourceFs, srcPath, destinationFs, dstPath);
+        LOG.debug("Same file?: {}", !toDelete);
+      } else {
+        // Otherwise, delete the destination file.
+        toDelete = true;

Review Comment:
   is it possible that srcType and dstType would be different?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:
+      // If destination is none, then no need to delete anything.
+      break;
+    case FILE:
+      // If destination is a file,
+      if (srcType == Type.FILE) {
+        // If source is a file, then delete the destination file if it is not same as source.
+        toDelete = !FileUtils.isSameFile(sourceFs, srcPath, destinationFs, dstPath);
+        LOG.debug("Same file?: {}", !toDelete);
+      } else {
+        // Otherwise, delete the destination file.
+        toDelete = true;
+      }
+      break;
+    case DIRECTORY:
+      // If destination is a directory,
+      if (srcType == Type.DIRECTORY) {
+        // If both are directories, visit for children of both.
+        Set<String> bothChildNames = Stream.concat(

Review Comment:
   why do you need concat? I think idea was to remove target-only files that are not identical to the source. You should only compare the common set.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:
+      // If destination is none, then no need to delete anything.
+      break;
+    case FILE:
+      // If destination is a file,
+      if (srcType == Type.FILE) {
+        // If source is a file, then delete the destination file if it is not same as source.
+        toDelete = !FileUtils.isSameFile(sourceFs, srcPath, destinationFs, dstPath);
+        LOG.debug("Same file?: {}", !toDelete);
+      } else {
+        // Otherwise, delete the destination file.
+        toDelete = true;
+      }
+      break;
+    case DIRECTORY:
+      // If destination is a directory,
+      if (srcType == Type.DIRECTORY) {
+        // If both are directories, visit for children of both.
+        Set<String> bothChildNames = Stream.concat(
+            Arrays.stream(sourceFs.listStatus(srcPath))

Review Comment:
   you are listing plenty of times, here and when calling getType



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/repl/CopyUtils.java:
##########
@@ -492,20 +495,94 @@ private void doRegularCopyOnce(FileSystem sourceFs, List<Path> srcList,
         throw new IOException(e);
       }
     } else {
-      //Destination should be empty
+      // Destination should have identical files only
       if (overWrite) {
-        deleteSubDirs(destinationFs, destination);
+        leaveIdenticalFilesOnly(sourceFs, paths, destinationFs, destination);
       }
       copyFilesBetweenFS(sourceFs, paths, destinationFs, destination, false, true);
     }
   }
 
-  private void deleteSubDirs(FileSystem fs, Path path) throws IOException {
-    //Delete the root path instead of doing a listing
-    //This is more optimised
-    delete(fs, path, true);
-    //Recreate just the Root folder
-    mkdirs(fs, path);
+  /**
+   * Leave identical files only between source and destination. It gets source paths.
+   */
+  void leaveIdenticalFilesOnly(FileSystem sourceFs, Path[] srcPaths,
+      FileSystem destinationFs, Path destinationPath) throws IOException {
+    for (Path srcPath : srcPaths) {
+      leaveIdenticalFilesOnlyFileByFile(sourceFs, srcPath,
+          destinationFs, new Path(destinationPath, srcPath.getName()));
+    }
+  }
+
+  /**
+   * Leave identical files only between source and destination. It gets a source path.
+   */
+  private void leaveIdenticalFilesOnlyFileByFile(FileSystem sourceFs, Path srcPath,
+      FileSystem destinationFs, Path dstPath) throws IOException {
+    boolean toDelete = false;
+    Type dstType = getType(destinationFs, dstPath);
+    Type srcType = getType(sourceFs, srcPath);
+    LOG.debug("Source path: {}, type: {}", srcPath, srcType);
+    LOG.debug("Destination path: {}, type: {}", dstPath, dstType);
+    switch (dstType) {
+    case NONE:

Review Comment:
   formatting



##########
common/src/java/org/apache/hadoop/hive/common/FileUtils.java:
##########
@@ -773,12 +816,57 @@ public static boolean copy(FileSystem srcFS, FileStatus srcStatus, FileSystem ds
     return deleteSource ? srcFS.delete(src, true) : true;
   }
 
+  /**
+   * Checks if the source and destination are the same file. It follows the same logic as
+   * https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html .
+   */
+  public static boolean isSameFile(FileSystem srcFS, Path src, FileSystem dstFS, Path dst)
+      throws IOException {
+    // When both file systems are HDFS, use strong check conditions;
+    // block size, length, checksum.
+    FileStatus srcStatus = srcFS.getFileStatus(src);
+    if (srcFS.getScheme().equals("hdfs") && dstFS.getScheme().equals("hdfs")) {
+      if (!dstFS.exists(dst)) {
+        return false;
+      }
+      FileStatus dstStatus = dstFS.getFileStatus(dst);

Review Comment:
   think about if we can reduce fs listing and use a recursive iterator:
   ````
   RemoteIterator<FileStatus> it = FileUtils.listStatusIterator(fs, path, FileUtils.HIDDEN_FILES_PATH_FILTER);
   ```` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] commented on pull request #3582: HIVE-25790 Make managed table copies handle updates (FileUtils)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #3582:
URL: https://github.com/apache/hive/pull/3582#issuecomment-1369287261

   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org