You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/07/14 09:48:59 UTC

[GitHub] [hive] pvary opened a new pull request #1251: HIVE-23840: Use LLAP to get orc metadata

pvary opened a new pull request #1251:
URL: https://github.com/apache/hive/pull/1251


   Started to use new LLAP getOrcTailFromCache
   Refactored stuff to use the tail instead of the reader related things
   Added some unit tests for the new smaller components


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] szlta commented on a change in pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
szlta commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454393621



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -129,6 +137,16 @@
    */
   private SearchArgument deleteEventSarg = null;
 
+  /**
+   * Cachetag associated with the Split
+   */
+  private final CacheTag cacheTag;
+
+  /**
+   * Skip using Llap IO cache for checking delete_delta files if the configuration is not correct
+   */
+  private static boolean skipLlapCache = true;

Review comment:
       Initialized to true on purpose for now? If not, I don't see it getting set to false.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) {
       try {
         final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
         if (deleteDeltaDirs.length > 0) {
+          FileSystem fs = orcSplit.getPath().getFileSystem(conf);
+          AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+              AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf);
           int totalDeleteEventCount = 0;
           for (Path deleteDeltaDir : deleteDeltaDirs) {
-            FileSystem fs = deleteDeltaDir.getFileSystem(conf);
+            if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir)) {
+              continue;
+            }
             Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
                 new OrcRawRecordMerger.Options().isCompacting(false), null);
             for (Path deleteDeltaFile : deleteDeltaFiles) {
               try {
-                /**
-                 * todo: we have OrcSplit.orcTail so we should be able to get stats from there
-                 */
-                Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf));
-                if (deleteDeltaReader.getNumberOfRows() <= 0) {
+                ReaderData readerData = getOrcTail(deleteDeltaFile, conf, cacheTag);
+                OrcTail orcTail = readerData.orcTail;
+                if (orcTail.getFooter().getNumberOfRows() <= 0) {
                   continue; // just a safe check to ensure that we are not reading empty delete files.
                 }
+                OrcRawRecordMerger.KeyInterval deleteKeyInterval = findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
+                if (!deleteKeyInterval.isIntersects(keyInterval)) {
+                  // If there is no intersection between data and delete delta, do not read delete file
+                  continue;
+                }
+                // Create the reader if we got the OrcTail from cache

Review comment:
       nit: comment could be more verbose, like: Reader can be reused if it was created before: only for non-LLAP cache cases, otherwise we need to create it here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454602727



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte
 
     this.syntheticProps = orcSplit.getSyntheticAcidProps();
 
+    if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon()
+            && HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE))
+    {
+      MapWork mapWork = LlapHiveUtils.findMapWork(conf);

Review comment:
       Good idea, done!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] szlta commented on a change in pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
szlta commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454390429



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte
 
     this.syntheticProps = orcSplit.getSyntheticAcidProps();
 
+    if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon()
+            && HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE))
+    {
+      MapWork mapWork = LlapHiveUtils.findMapWork(conf);

Review comment:
       We could spare the deserialization of MapWork from JobConf here, if we pass the MapWork instance already present in LlapRecordReader to VectorizedOrcAcidRowBatchReader ctor. (Downside is that in turn we would need to adjust the other ctor's of VectorizedOrcAcidRowBatchReader too)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454602904



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -129,6 +137,16 @@
    */
   private SearchArgument deleteEventSarg = null;
 
+  /**
+   * Cachetag associated with the Split
+   */
+  private final CacheTag cacheTag;
+
+  /**
+   * Skip using Llap IO cache for checking delete_delta files if the configuration is not correct
+   */
+  private static boolean skipLlapCache = true;

Review comment:
       That was a mistake. Corrected, and initialized as false




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454603042



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) {
       try {
         final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
         if (deleteDeltaDirs.length > 0) {
+          FileSystem fs = orcSplit.getPath().getFileSystem(conf);
+          AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+              AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf);
           int totalDeleteEventCount = 0;
           for (Path deleteDeltaDir : deleteDeltaDirs) {
-            FileSystem fs = deleteDeltaDir.getFileSystem(conf);
+            if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir)) {
+              continue;
+            }
             Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
                 new OrcRawRecordMerger.Options().isCompacting(false), null);
             for (Path deleteDeltaFile : deleteDeltaFiles) {
               try {
-                /**
-                 * todo: we have OrcSplit.orcTail so we should be able to get stats from there
-                 */
-                Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf));
-                if (deleteDeltaReader.getNumberOfRows() <= 0) {
+                ReaderData readerData = getOrcTail(deleteDeltaFile, conf, cacheTag);
+                OrcTail orcTail = readerData.orcTail;
+                if (orcTail.getFooter().getNumberOfRows() <= 0) {
                   continue; // just a safe check to ensure that we are not reading empty delete files.
                 }
+                OrcRawRecordMerger.KeyInterval deleteKeyInterval = findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
+                if (!deleteKeyInterval.isIntersects(keyInterval)) {
+                  // If there is no intersection between data and delete delta, do not read delete file
+                  continue;
+                }
+                // Create the reader if we got the OrcTail from cache

Review comment:
       Added more comment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary merged pull request #1251: HIVE-23840: Use LLAP to get orc metadata

Posted by GitBox <gi...@apache.org>.
pvary merged pull request #1251:
URL: https://github.com/apache/hive/pull/1251


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org