You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/02/11 13:10:59 UTC

[GitHub] [hive] pvargacl opened a new pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

pvargacl opened a new pull request #1971:
URL: https://github.com/apache/hive/pull/1971


   
   ### What changes were proposed in this pull request?
   Reuse committed file list from directinsert manifest file during loadPartition to avoid FileSystem listing
   
   
   ### Why are the changes needed?
   Performance improvement
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing unit tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581121257



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4521,7 +4523,9 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           for (int i = 0; i < partitionCount; ++i) {
             String nextPart = mdis.readUTF();
             Utilities.FILE_OP_LOGGER.debug("Looking at dynamic partition {}", nextPart);
-            dynamicPartitionSpecs.add(nextPart);
+            if (!dynamicPartitionSpecs.containsKey(nextPart)) {
+              dynamicPartitionSpecs.put(nextPart, new ArrayList<>());

Review comment:
       Not really, but it starts a refactor domino. I started it but reverted, I do not want to pollute this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl merged pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl merged pull request #1971:
URL: https://github.com/apache/hive/pull/1971


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581012265



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4521,7 +4523,9 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           for (int i = 0; i < partitionCount; ++i) {
             String nextPart = mdis.readUTF();
             Utilities.FILE_OP_LOGGER.debug("Looking at dynamic partition {}", nextPart);
-            dynamicPartitionSpecs.add(nextPart);
+            if (!dynamicPartitionSpecs.containsKey(nextPart)) {
+              dynamicPartitionSpecs.put(nextPart, new ArrayList<>());

Review comment:
       At the end it is used as a List in PartitionDetails, I could use a set here, but I would convert it anyway to a list.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580911572



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2362,10 +2362,17 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
               + ", Direct insert = " + isDirectInsert + ")");
         }
         if (newFiles != null) {
-          newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
-          newFiles.addAll(newFileStatuses.stream()
-              .map(FileStatus::getPath)
-              .collect(Collectors.toList()));
+          if (!newFiles.isEmpty()) {
+            // We already know the file list from the direct insert manifestfile
+            FileSystem srcFs = loadPath.getFileSystem(conf);
+            newFileStatuses = new ArrayList<>();
+            for (Path filePath : newFiles) {

Review comment:
       if you already have streams here, could we use foreach?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580913139



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4533,6 +4537,12 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           if (!committed.add(path)) {
             throw new HiveException(nextFile + " was specified in multiple manifests");
           }
+          for (Map.Entry<String, List<Path>> dynpath : dynamicPartitionSpecs.entrySet()){

Review comment:
       why not just filter and collect?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581010868



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2362,10 +2362,17 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
               + ", Direct insert = " + isDirectInsert + ")");
         }
         if (newFiles != null) {
-          newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
-          newFiles.addAll(newFileStatuses.stream()
-              .map(FileStatus::getPath)
-              .collect(Collectors.toList()));
+          if (!newFiles.isEmpty()) {
+            // We already know the file list from the direct insert manifestfile
+            FileSystem srcFs = loadPath.getFileSystem(conf);
+            newFileStatuses = new ArrayList<>();
+            for (Path filePath : newFiles) {

Review comment:
       getFileStatus throws IOException, and I do not want to wrap ip

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4533,6 +4537,12 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           if (!committed.add(path)) {
             throw new HiveException(nextFile + " was specified in multiple manifests");
           }
+          for (Map.Entry<String, List<Path>> dynpath : dynamicPartitionSpecs.entrySet()){

Review comment:
       done

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/FileSinkDesc.java
##########
@@ -106,7 +108,7 @@
 
   private Set<FileStatus> filesToFetch = null;
 
-  private Set<String> dynPartitionValues = new HashSet<>();
+  private Map<String, List<Path>> dynPartitionValues = new HashMap<>();

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580992240



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/FileSinkDesc.java
##########
@@ -106,7 +108,7 @@
 
   private Set<FileStatus> filesToFetch = null;
 
-  private Set<String> dynPartitionValues = new HashSet<>();
+  private Map<String, List<Path>> dynPartitionValues = new HashMap<>();

Review comment:
       nit: a comment would be nice describing keys and values




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581010173



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2362,10 +2362,17 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
               + ", Direct insert = " + isDirectInsert + ")");
         }
         if (newFiles != null) {
-          newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
-          newFiles.addAll(newFileStatuses.stream()
-              .map(FileStatus::getPath)
-              .collect(Collectors.toList()));
+          if (!newFiles.isEmpty()) {
+            // We already know the file list from the direct insert manifestfile
+            FileSystem srcFs = loadPath.getFileSystem(conf);
+            newFileStatuses = new ArrayList<>();
+            for (Path filePath : newFiles) {
+              newFileStatuses.add(srcFs.getFileStatus(filePath));
+            }
+          } else {
+            newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
+            newFiles.addAll(newFileStatuses.stream().map(FileStatus::getPath).collect(Collectors.toList()));

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580907455



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -2862,29 +2862,29 @@ public boolean skipProcessing(Task<?> task) {
    * corresponding to these dynamic partitions.
    */
   public static Map<Path, PartitionDetails> getFullDPSpecs(Configuration conf, DynamicPartitionCtx dpCtx,
-      Set<String> dynamicPartitionSpecs) throws HiveException {
+      Map<String, List<Path>> dynamicPartitionSpecs) throws HiveException {
 
     try {
       Path loadPath = dpCtx.getRootPath();
       FileSystem fs = loadPath.getFileSystem(conf);
       int numDPCols = dpCtx.getNumDPCols();
-      List<Path> allPartition = new ArrayList<>();
+      Map<Path, List<Path>> allPartition = new HashMap<>();
       if (dynamicPartitionSpecs != null) {
-        for (String partSpec : dynamicPartitionSpecs) {
-          allPartition.add(new Path(loadPath, partSpec));
+        for (Map.Entry<String, List<Path>> partSpec : dynamicPartitionSpecs.entrySet()) {
+          allPartition.put(new Path(loadPath, partSpec.getKey()), partSpec.getValue());
         }
       } else {
         List<FileStatus> status = HiveStatsUtils.getFileStatusRecurse(loadPath, numDPCols, fs);
         for (FileStatus fileStatus : status) {
-          allPartition.add(fileStatus.getPath());
+          allPartition.put(fileStatus.getPath(), null);

Review comment:
       Would it require much of refactor if we switch to Optional here? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580913139



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4533,6 +4537,12 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           if (!committed.add(path)) {
             throw new HiveException(nextFile + " was specified in multiple manifests");
           }
+          for (Map.Entry<String, List<Path>> dynpath : dynamicPartitionSpecs.entrySet()){

Review comment:
       why not just find any match?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581010051



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -2862,29 +2862,29 @@ public boolean skipProcessing(Task<?> task) {
    * corresponding to these dynamic partitions.
    */
   public static Map<Path, PartitionDetails> getFullDPSpecs(Configuration conf, DynamicPartitionCtx dpCtx,
-      Set<String> dynamicPartitionSpecs) throws HiveException {
+      Map<String, List<Path>> dynamicPartitionSpecs) throws HiveException {
 
     try {
       Path loadPath = dpCtx.getRootPath();
       FileSystem fs = loadPath.getFileSystem(conf);
       int numDPCols = dpCtx.getNumDPCols();
-      List<Path> allPartition = new ArrayList<>();
+      Map<Path, List<Path>> allPartition = new HashMap<>();
       if (dynamicPartitionSpecs != null) {
-        for (String partSpec : dynamicPartitionSpecs) {
-          allPartition.add(new Path(loadPath, partSpec));
+        for (Map.Entry<String, List<Path>> partSpec : dynamicPartitionSpecs.entrySet()) {
+          allPartition.put(new Path(loadPath, partSpec.getKey()), partSpec.getValue());
         }
       } else {
         List<FileStatus> status = HiveStatsUtils.getFileStatusRecurse(loadPath, numDPCols, fs);
         for (FileStatus fileStatus : status) {
-          allPartition.add(fileStatus.getPath());
+          allPartition.put(fileStatus.getPath(), null);

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r581112577



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4521,7 +4523,9 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           for (int i = 0; i < partitionCount; ++i) {
             String nextPart = mdis.readUTF();
             Utilities.FILE_OP_LOGGER.debug("Looking at dynamic partition {}", nextPart);
-            dynamicPartitionSpecs.add(nextPart);
+            if (!dynamicPartitionSpecs.containsKey(nextPart)) {
+              dynamicPartitionSpecs.put(nextPart, new ArrayList<>());

Review comment:
       What is the expected order of the items in the list?
   Do we really need a list there?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580991520



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -4521,7 +4523,9 @@ public static void handleDirectInsertTableFinalPath(Path specPath, String unionS
           for (int i = 0; i < partitionCount; ++i) {
             String nextPart = mdis.readUTF();
             Utilities.FILE_OP_LOGGER.debug("Looking at dynamic partition {}", nextPart);
-            dynamicPartitionSpecs.add(nextPart);
+            if (!dynamicPartitionSpecs.containsKey(nextPart)) {
+              dynamicPartitionSpecs.put(nextPart, new ArrayList<>());

Review comment:
       nit: why List?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] deniskuzZ commented on a change in pull request #1971: HIVE-24738: Reuse committed filelist from directinsert manifest during loadPartition

Posted by GitBox <gi...@apache.org>.
deniskuzZ commented on a change in pull request #1971:
URL: https://github.com/apache/hive/pull/1971#discussion_r580912279



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2362,10 +2362,17 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
               + ", Direct insert = " + isDirectInsert + ")");
         }
         if (newFiles != null) {
-          newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
-          newFiles.addAll(newFileStatuses.stream()
-              .map(FileStatus::getPath)
-              .collect(Collectors.toList()));
+          if (!newFiles.isEmpty()) {
+            // We already know the file list from the direct insert manifestfile
+            FileSystem srcFs = loadPath.getFileSystem(conf);
+            newFileStatuses = new ArrayList<>();
+            for (Path filePath : newFiles) {
+              newFileStatuses.add(srcFs.getFileStatus(filePath));
+            }
+          } else {
+            newFileStatuses = listFilesCreatedByQuery(loadPath, writeId, stmtId);
+            newFiles.addAll(newFileStatuses.stream().map(FileStatus::getPath).collect(Collectors.toList()));

Review comment:
       can you please format it for better readability




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org