You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/01/27 08:59:35 UTC

[GitHub] [hive] pvargacl opened a new pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

pvargacl opened a new pull request #1915:
URL: https://github.com/apache/hive/pull/1915


   
   
   ### What changes were proposed in this pull request?
   The dynamic partition infos can be collected from the manifest files, no need to do a costly file listing later in the MoveTask
   
   
   ### Why are the changes needed?
   Performance improvement
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   All unit tests are working, local performance tests
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573880385



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1470,16 +1470,15 @@ private static ValidTxnList getValidTxnList(Configuration conf) {
     return dirToSnapshots;
   }
 
-  private static boolean isChildOfDelta(Path childDir, Path rootPath) {
+  public static boolean isChildOfDelta(Path childDir, Path rootPath) {
     if (childDir.toUri().toString().length() <= rootPath.toUri().toString().length()) {
       return false;
     }
     // We do not want to look outside the original directory
     String fullName = childDir.toUri().toString().substring(rootPath.toUri().toString().length() + 1);
     String dirName = childDir.getName();
-    return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) ||
-        (fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
-        (fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
+    return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)
+          && (fullName.contains(BASE_PREFIX) || fullName.contains(DELTA_PREFIX) || fullName.contains(DELETE_DELTA_PREFIX));

Review comment:
       If I understand correctly this is changing the logic. Is it intentional?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573868615



##########
File path: ql/src/test/results/clientpositive/llap/mm_all.q.out
##########
@@ -1649,8 +1649,6 @@ POSTHOOK: Lineage: ###Masked###
 POSTHOOK: Lineage: ###Masked###
 POSTHOOK: Lineage: ###Masked###
 POSTHOOK: Lineage: ###Masked###
-POSTHOOK: Lineage: ###Masked###

Review comment:
       Were these also duplicated rows? Could you please check the q.out.orig files before the masking?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573910911



##########
File path: ql/src/test/results/clientpositive/llap/acid_multiinsert_dyn_part.q.out
##########
@@ -649,16 +649,10 @@ POSTHOOK: Output: default@multiinsert_test_mm@c=2222
 POSTHOOK: Output: default@multiinsert_test_mm@c=__HIVE_DEFAULT_PARTITION__
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]

Review comment:
       yes these were duplicates




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573909577



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1470,16 +1470,15 @@ private static ValidTxnList getValidTxnList(Configuration conf) {
     return dirToSnapshots;
   }
 
-  private static boolean isChildOfDelta(Path childDir, Path rootPath) {
+  public static boolean isChildOfDelta(Path childDir, Path rootPath) {
     if (childDir.toUri().toString().length() <= rootPath.toUri().toString().length()) {
       return false;
     }
     // We do not want to look outside the original directory
     String fullName = childDir.toUri().toString().substring(rootPath.toUri().toString().length() + 1);
     String dirName = childDir.getName();
-    return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) ||
-        (fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
-        (fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
+    return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)
+          && (fullName.contains(BASE_PREFIX) || fullName.contains(DELTA_PREFIX) || fullName.contains(DELETE_DELTA_PREFIX));

Review comment:
       Yes. Previously it was working only for rootpath like delta_00001/subdir/, now it works for table/partition=1/delta_00001/subdir
   The reordering was needed because fullName.contains(delta_prefix) && !dirname.startWith(delta_prefix) is true for delete_delta_000001, and it would yield wrong result




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573868139



##########
File path: ql/src/test/results/clientpositive/llap/acid_multiinsert_dyn_part.q.out
##########
@@ -649,16 +649,10 @@ POSTHOOK: Output: default@multiinsert_test_mm@c=2222
 POSTHOOK: Output: default@multiinsert_test_mm@c=__HIVE_DEFAULT_PARTITION__
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=1111).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
 POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
-POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]

Review comment:
       are these duplicate rows in the lineage info?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573913730



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1470,16 +1470,15 @@ private static ValidTxnList getValidTxnList(Configuration conf) {
     return dirToSnapshots;
   }
 
-  private static boolean isChildOfDelta(Path childDir, Path rootPath) {
+  public static boolean isChildOfDelta(Path childDir, Path rootPath) {
     if (childDir.toUri().toString().length() <= rootPath.toUri().toString().length()) {
       return false;
     }
     // We do not want to look outside the original directory
     String fullName = childDir.toUri().toString().substring(rootPath.toUri().toString().length() + 1);
     String dirName = childDir.getName();
-    return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) ||
-        (fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
-        (fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
+    return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvargacl commented on pull request #1915:
URL: https://github.com/apache/hive/pull/1915#issuecomment-768139094


   @kuczoram could you please review this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl merged pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvargacl merged pull request #1915:
URL: https://github.com/apache/hive/pull/1915


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1915: HIVE-24682: Collect dynamic partition info in FileSink for direct insert

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1915:
URL: https://github.com/apache/hive/pull/1915#discussion_r573869599



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -1470,16 +1470,15 @@ private static ValidTxnList getValidTxnList(Configuration conf) {
     return dirToSnapshots;
   }
 
-  private static boolean isChildOfDelta(Path childDir, Path rootPath) {
+  public static boolean isChildOfDelta(Path childDir, Path rootPath) {
     if (childDir.toUri().toString().length() <= rootPath.toUri().toString().length()) {
       return false;
     }
     // We do not want to look outside the original directory
     String fullName = childDir.toUri().toString().substring(rootPath.toUri().toString().length() + 1);
     String dirName = childDir.getName();
-    return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) ||
-        (fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
-        (fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
+    return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)

Review comment:
       Is this an extra `()` on the first one?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org