You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/01/25 20:20:06 UTC
[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r563956041



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       You say this is changed to be compatible with ORC setting. I do not understand why this is necessary and what its impact is. This looks like a change that is not to be taken lightly

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,438 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException {
+    return codec != null && (!(codec instanceof DirectDecompressionCodec) || !((DirectDecompressionCodec)codec).isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Can you invert this condition. There are a lot of negatives making this hard to understand
   `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       I am not sure about this. Previously we did not create the full reader. Why do we need to create the reader now? All calls from here use orcTail anyway except `List<StripeStatistics> stats = orcReaderData.reader.getVariantStripeStatistics(null);`. However, we can create this also from the info in tail, too.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),

Review comment:
       Please extract "stripe stats" to a constant.
   I understand the correct way to get the stripe stats is to get it from OrcReader.getStripeStats. However, we want to avoid creating a reader when we have the metadata in cache. That is why we do this bit here as far as I understand. It would be better to extract this part to a helper method with some javadoc explaining why we do this.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null));

Review comment:
       I am confused with this part. previouse stripe id and encryption keys are never relevant?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       getFileLength() is still available. Why this change?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       This and `getKeyInterval(List<OrcProto.ColumnStatistics> colStats)` are almost the same. Can you get rid of this duplication?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org