You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/01/03 22:45:18 UTC

[GitHub] [hive] pgaref opened a new pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

pgaref opened a new pull request #1823:
URL: https://github.com/apache/hive/pull/1823


   ### What changes were proposed in this pull request?
   Bump apache ORC version to 1.6.6
   
   ### Why are the changes needed?
   So hive can take advantage of the latest features and bug fixes
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   Internal tests + q files


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r557630482



##########
File path: pom.xml
##########
@@ -179,7 +179,7 @@
     <mariadb.version>2.5.0</mariadb.version>
     <postgres.version>42.2.14</postgres.version>
     <opencsv.version>2.3</opencsv.version>
-    <orc.version>1.5.12</orc.version>
+    <orc.version>1.6.7-SNAPSHOT</orc.version>

Review comment:
       Thanks. That's invaluable. Please mention that when you send an email to `dev@orc` mailing list, too.
   - https://issues.apache.org/jira/browse/ORC-705
   - https://issues.apache.org/jira/browse/ORC-706
   - https://issues.apache.org/jira/browse/ORC-709
   - https://issues.apache.org/jira/browse/ORC-724




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568174337



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553)  Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) 
   
   Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them  -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one.
   All I am trying to say here is that this is orthogonal to the ORC version bump.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r563956041



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       You say this is changed to be compatible with ORC setting. I do not understand why this is necessary and what its impact is. This looks like a change that is not to be taken lightly

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,438 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException {
+    return codec != null && (!(codec instanceof DirectDecompressionCodec) || !((DirectDecompressionCodec)codec).isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Can you invert this condition. There are a lot of negatives making this hard to understand
   `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       I am not sure about this. Previously we did not create the full reader. Why do we need to create the reader now? All calls from here use orcTail anyway except `List<StripeStatistics> stats = orcReaderData.reader.getVariantStripeStatistics(null);`. However, we can create this also from the info in tail, too.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),

Review comment:
       Please extract "stripe stats" to a constant.
   I understand the correct way to get the stripe stats is to get it from OrcReader.getStripeStats. However, we want to avoid creating a reader when we have the metadata in cache. That is why we do this bit here as far as I understand. It would be better to extract this part to a helper method with some javadoc explaining why we do this.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null));

Review comment:
       I am confused with this part. previouse stripe id and encryption keys are never relevant?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       getFileLength() is still available. Why this change?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       This and `getKeyInterval(List<OrcProto.ColumnStatistics> colStats)` are almost the same. Can you get rid of this duplication?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Isn't this always `false`? Don't we need another case for TIMESTAMP_INSTANT?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc.encoded;
+
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.OrcIndex;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/** An abstract data reader that IO formats can use to read bytes from underlying storage. */
+public interface LlapDataReader extends AutoCloseable, Cloneable {
+
+  /** Opens the DataReader, making it ready to use. */
+  void open() throws IOException;
+
+  OrcIndex readRowIndex(StripeInformation stripe,
+      TypeDescription fileSchema,
+      OrcProto.StripeFooter footer,
+      boolean ignoreNonUtf8BloomFilter,
+      boolean[] included,
+      OrcProto.RowIndex[] indexes,
+      boolean[] sargColumns,
+      OrcFile.WriterVersion version,
+      OrcProto.Stream.Kind[] bloomFilterKinds,
+      OrcProto.BloomFilterIndex[] bloomFilterIndices
+  ) throws IOException;
+
+  OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws IOException;
+
+  /** Reads the data.
+   *
+   * Note that for the cases such as zero-copy read, caller must release the disk ranges
+   * produced after being done with them. Call isTrackingDiskRanges to find out if this is needed.
+   * @param range List if disk ranges to read. Ranges with data will be ignored.
+   * @param baseOffset Base offset from the start of the file of the ranges in disk range list.
+   * @param doForceDirect Whether the data should be read into direct buffers.
+   * @return New or modified list of DiskRange-s, where all the ranges are filled with data.
+   */
+  DiskRangeList readFileData(
+      DiskRangeList range, long baseOffset, boolean doForceDirect) throws IOException;
+
+
+  /**
+   * Whether the user should release buffers created by readFileData. See readFileData javadoc.
+   */
+  boolean isTrackingDiskRanges();
+
+  /**
+   * Releases buffers created by readFileData. See readFileData javadoc.
+   * @param toRelease The buffer to release.
+   */
+  void releaseBuffer(ByteBuffer toRelease);
+
+  /**
+   * Clone the entire state of the DataReader with the assumption that the
+   * clone will be closed at a different time. Thus, any file handles in the
+   * implementation need to be cloned.
+   * @return a new instance
+   */
+  LlapDataReader clone();
+
+  @Override
+  void close() throws IOException;
+
+  /**
+   * Returns the compression codec used by this datareader.
+   * We should consider removing this from the interface.
+   * @return the compression codec
+   */
+  CompressionCodec getCompressionCodec();

Review comment:
       This interface looks like a copy of ORC's DataReader except this method. ORC's DataReader returns as StreamOptions instead of CompressionCodec. As far as i understand, StreamOptions includes the compression codec and more info. Morevover, I see other parts of the code already make use of StreamOptions so it should be possible to integrate this part of the code too. I do not understand why LlapDataReader interface is necessary.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java
##########
@@ -68,6 +70,10 @@
 
   private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024;
   private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024;
+  private static final GregorianCalendar PROLEPTIC = new GregorianCalendar();

Review comment:
       This is not used anywhere?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r570105800



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Sure thing! This is now tracked as HIVE-24735 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568157322



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       I still do not understand why we need to change LLAP Allocator's maximum allocation size. Does LLAP allocator serve only ORC writers? I think it is used for other buffer needs too.
   
   Hive depends on ORC. So I dont understand how ORC uses LLAP_ALLOCATOR_MAX_ALLOC for anything. We pass orc writers the appropriate configs. If ORC writers need smaller buffer, we can configure that for those writers via WriterOptions. There is no need to change llap allocator's settings for that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-760332420


   cc @sunchao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-756759959


   the thing is that before I added that repo I pressed the test button which reported the error - because I think it also does the same for blackout detection - it would have probably not worked...
   
   I've added the repsy to the artifactory and added to the virtual repo precommit uses


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref edited a comment on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-769755146


   > I only partially reviewed this. Will continue reviewing.
   > One question: I see we do not care about column encryption related arguments in multiple places. Is it because column encryption is not supported?
   
   Hey @mustafaiman good question with a complicated answer -- while creating this I also did some digging to find out whats supported and what not. To sum up my findings: 
   
   -  It looks like we are currently able to encrypt entire tables and/or data on hdfs using kms: HIVE-8065
   -  Support for column level encryption/decryption (passing some encryption setting to the Table props and let Hive take care of the rest) started more than a while ago as part of HIVE-6329 
   -  There was a community discussion as part of HIVE-21848 to unify encryption table properties (at least for ORC and Parquet) that concluded in the accepted options
   - However, these properties are still not propagated to the tables: HIVE-21849
   
   I believe part of the reason is that Hive already integrates with Apache Ranger that can restrict user access to particular columns and also adds data-masking on top.
   However, I am more than happy discussing the revival of column encryption at some point.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566919513



##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -249,15 +249,15 @@ Stripes:
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 0.029 expectedFpp: 7.090246E-7
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 loadFactor: 0.2825 expectedFpp: 0.0063713384
     Row group indices for column 9:
-      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,7,488,0,1538,488
+      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 positions: 0,0,0,0,0,0
+      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703076 max: 2013-03-01 09:11:58.703325 positions: 0,7,488,0,1538,488
     Bloom filters for column 9:
       Entry 0: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
     Row group indices for column 10:
-      Entry 0: count: 1000 hasNull: false min: 8 max: 9994 sum: 5118211 positions: 0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 248 max: 9490 sum: 246405 positions: 0,2194,0,4,488
+      Entry 0: count: 1000 hasNull: false min: 0.08 max: 99.94 sum: 51182.11 positions: 0,0,0,0,0

Review comment:
       yeah, this is related to the TS conversion issue described above




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-765763006


   Apache ORC 1.6.7 is officially release. Could you update the PR to use the official one, @pgaref ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772842805


   Gentle ping @mustafaiman @jcamachor  -- any further comments here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566941929



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java
##########
@@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception {
         + "binary,string1:string,middle:struct<list:array<struct<int1:int,"
         + "string1:string>>>,list:array<struct<int1:int,string1:string>>,"
         + "map:map<string,struct<int1:int,string1:string>>,ts:timestamp,"
-        + "decimal1:decimal(38,18)>", readerInspector.getTypeName());
+        + "decimal1:decimal(38,10)>", readerInspector.getTypeName());

Review comment:
       Seems that that type scale was not properly propaged before and was using HiveDecimal.SYSTEM_DEFAULT_SCALE which is 18: https://github.com/apache/hive/blob/ff6f3565e50148b7bcfbcf19b970379f2bd59290/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcStruct.java#L607
   
   This is actually in-line with ORC tests for the same file:
   https://github.com/apache/orc/blob/b54d10cedf5ec1529cf06d77268510c216402cba/java/core/src/test/org/apache/orc/impl/TestRecordReaderImpl.java#L141




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-755961113


   Gentle ping @kgyrtkirk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568166455



##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,440 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec,
+      RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException {
+    return codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec)
+        .isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Good catch, FIXed thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568200811



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       I see ORC strictly enforces this now. I would set the appropriate setting at Hive-ORC boundary and leave the LLAP_ALLOCATOR_MAX_ALLOC as it is (Math.min(llap.allocator.max, what ORC enforces). If you think we should set LLAP_ALLOCATOR_MAX_ALLOC to be the same as what ORC enforces, that can be done in a seperate ticket. Like you said this is orthogonal to ORC version bump, therefore should be discussed in its own ticket.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r557553348



##########
File path: pom.xml
##########
@@ -179,7 +179,7 @@
     <mariadb.version>2.5.0</mariadb.version>
     <postgres.version>42.2.14</postgres.version>
     <opencsv.version>2.3</opencsv.version>
-    <orc.version>1.5.12</orc.version>
+    <orc.version>1.6.7-SNAPSHOT</orc.version>

Review comment:
       So, we cannot use `1.6.6`? What ORC issue was the blocker for this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r557574540



##########
File path: pom.xml
##########
@@ -179,7 +179,7 @@
     <mariadb.version>2.5.0</mariadb.version>
     <postgres.version>42.2.14</postgres.version>
     <opencsv.version>2.3</opencsv.version>
-    <orc.version>1.5.12</orc.version>
+    <orc.version>1.6.7-SNAPSHOT</orc.version>

Review comment:
       Hey @dongjoon-hyun -- yeah there are a few issues that are recently resolved but not released as part of 1.6.6, such as: ORC-705 , ORC-706 , ORC-709 and ORC-724




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-771227835


   > All q.out files show data size increase for tables. Since most of them are consistently additional 4 bytes per row, that seems like not a bug. However, I found some irregular increases too like 16 bytes per row. Can you explain why data size increased so we can check the irregularities and make sure they are expected?
   
   Hey @mustafaiman -- the main size differences are on Timestamp columns where we now support nanosecond precision (using 2 extra variables for the lower and the upper precision as part of the stats -- see [ORC-611](https://issues.apache.org/jira/browse/ORC-611)).
   
   Other than that there are other changes that can also affect size, such as: Trimming StringStatistics minimum and maximum values as part of ORC-203  or List and Map column statistics that was recently added as part of ORC-398.
   
   Happy to check further if you have doubts about a particular query.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r567807669



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       The issue here is that LLAP_ALLOCATOR_MAX_ALLOC is also used as the ORC Writer buffer size (thus the change).
   
   Initial buffer size check was introduced in [ORC-238](https://github.com/apache/orc/pull/171/files) even though it was only applied when buffer size was enforced from table properties. Later, on ORC-1.6 this was enforced for the [Writer buffer size in general](https://github.com/apache/orc/blob/0128f817b0ab28fa2d0660737234ac966f0f5c50/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L171).
   
   The max bufferSize argument can be up to 2^(3*8 - 1) -- meaning less than 8Mb and since we enforce the size to be power of 2 the next available is 4Mb.
   
   The main question here is if there is a reason for the upper limit to be < 8 Mb (cc @prasanthj that might know more here) -- or if we should decouple the two configuration (LLAP alloc and ORC Writer buffer size).
   
   I believe the best thing to do for now is open a new Ticket to track this (as this will either require more work on LLAP, or a new release on ORC) -- and I do not expect this to cause any major issues until then. @mustafaiman what do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor merged pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
jcamachor merged pull request #1823:
URL: https://github.com/apache/hive/pull/1823


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568204044



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Even though TimeStamp with local timezone was added as part of [ORC-189](https://issues.apache.org/jira/browse/ORC-189) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566867490



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       This is one of the breaking ORC changes introduced by encryption support.
   As Tail and thus StripeStatistics may be encrypted, we always need a reader instance to retrieve them.
   
   OrcTail maintained the API call for backwards compatibility but it still expects a reader to actually retrieve the stats:
   https://github.com/apache/orc/blob/d78cc39a9299b62bc8a5d8f5c3fac9201e03cb8b/java/core/src/java/org/apache/orc/impl/OrcTail.java#L210




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754556917


   > note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   > 
   > anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/ it for you - however that page returns error 400 for everything...
   
   Hey Zoltan -- I noticed the 400 myself (for listing and non-existing files) but maven pulling seems to work, for example, check https://raw.githubusercontent.com/pgaref/mave-repo/main/org/apache/orc/orc-core/maven-metadata-local.xml


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568567405



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate this further

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate further




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
jcamachor commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772850624


   Thanks for addressing the comments @pgaref . I am fine from my side, +1.
   
   I'd like to hear from @mustafaiman , if it's fine from his side, we can merge it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-753694143


   Seems like the current Config File provider for Jenkins does not allow snapshots for testing Hive (needed for the above changes until we do an ORC release).
   @kgyrtkirk any idea if we could add apache snapshots in the existing Jenkins maven repo configuration? Looks like there is restricted access to that Config file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r563956041



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       You say this is changed to be compatible with ORC setting. I do not understand why this is necessary and what its impact is. This looks like a change that is not to be taken lightly

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,438 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException {
+    return codec != null && (!(codec instanceof DirectDecompressionCodec) || !((DirectDecompressionCodec)codec).isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Can you invert this condition. There are a lot of negatives making this hard to understand
   `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       I am not sure about this. Previously we did not create the full reader. Why do we need to create the reader now? All calls from here use orcTail anyway except `List<StripeStatistics> stats = orcReaderData.reader.getVariantStripeStatistics(null);`. However, we can create this also from the info in tail, too.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),

Review comment:
       Please extract "stripe stats" to a constant.
   I understand the correct way to get the stripe stats is to get it from OrcReader.getStripeStats. However, we want to avoid creating a reader when we have the metadata in cache. That is why we do this bit here as far as I understand. It would be better to extract this part to a helper method with some javadoc explaining why we do this.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null));

Review comment:
       I am confused with this part. previouse stripe id and encryption keys are never relevant?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       getFileLength() is still available. Why this change?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       This and `getKeyInterval(List<OrcProto.ColumnStatistics> colStats)` are almost the same. Can you get rid of this duplication?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566908774



##########
File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out
##########
@@ -162,8 +162,8 @@ Stripe Statistics:
     Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7
     Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368
     Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736
-    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
-    Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
+    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808999999

Review comment:
       Yes, this is expected as we are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref edited a comment on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754556917


   > note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   > 
   > anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/ it for you - however that page returns error 400 for everything...
   
   Hey Zoltan -- I noticed the 400 myself for listing and non-existing files but maven pulling seems to work, for example, check https://raw.githubusercontent.com/pgaref/mave-repo/main/org/apache/orc/orc-core/maven-metadata-local.xml


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r569775411



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       @pgaref , can we create a follow-up JIRA to implement TIMESTAMP WITH LOCAL TIME ZONE integration with ORC so we do not forget about it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772864136


   Looks good to me. Thanks for the effort @pgaref 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568174337



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553)  Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) 
   
   Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them  -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one.
   All I am trying to say here is that this is orthogonal the ORC version bump.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568161096



##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,440 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec,
+      RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException {
+    return codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec)
+        .isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool);

Review comment:
       I think this was equivalent to `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null`
    before. Looks like `null: SHIMS.getZeroCopyReader` thing got inverted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754550816


   note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   
   anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/  it for you - however that page returns error 400 for everything...
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r567800512



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       But I agree, cache should be populated with the original **getFileTail().getFileLength()**  as it is afterward used for comparison (thus reverted this change) -- however, where ReaderImpl.extractFileTail is now called uses the buffer size instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r564077084



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Isn't this always `false`? Don't we need another case for TIMESTAMP_INSTANT?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc.encoded;
+
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.OrcIndex;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/** An abstract data reader that IO formats can use to read bytes from underlying storage. */
+public interface LlapDataReader extends AutoCloseable, Cloneable {
+
+  /** Opens the DataReader, making it ready to use. */
+  void open() throws IOException;
+
+  OrcIndex readRowIndex(StripeInformation stripe,
+      TypeDescription fileSchema,
+      OrcProto.StripeFooter footer,
+      boolean ignoreNonUtf8BloomFilter,
+      boolean[] included,
+      OrcProto.RowIndex[] indexes,
+      boolean[] sargColumns,
+      OrcFile.WriterVersion version,
+      OrcProto.Stream.Kind[] bloomFilterKinds,
+      OrcProto.BloomFilterIndex[] bloomFilterIndices
+  ) throws IOException;
+
+  OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws IOException;
+
+  /** Reads the data.
+   *
+   * Note that for the cases such as zero-copy read, caller must release the disk ranges
+   * produced after being done with them. Call isTrackingDiskRanges to find out if this is needed.
+   * @param range List if disk ranges to read. Ranges with data will be ignored.
+   * @param baseOffset Base offset from the start of the file of the ranges in disk range list.
+   * @param doForceDirect Whether the data should be read into direct buffers.
+   * @return New or modified list of DiskRange-s, where all the ranges are filled with data.
+   */
+  DiskRangeList readFileData(
+      DiskRangeList range, long baseOffset, boolean doForceDirect) throws IOException;
+
+
+  /**
+   * Whether the user should release buffers created by readFileData. See readFileData javadoc.
+   */
+  boolean isTrackingDiskRanges();
+
+  /**
+   * Releases buffers created by readFileData. See readFileData javadoc.
+   * @param toRelease The buffer to release.
+   */
+  void releaseBuffer(ByteBuffer toRelease);
+
+  /**
+   * Clone the entire state of the DataReader with the assumption that the
+   * clone will be closed at a different time. Thus, any file handles in the
+   * implementation need to be cloned.
+   * @return a new instance
+   */
+  LlapDataReader clone();
+
+  @Override
+  void close() throws IOException;
+
+  /**
+   * Returns the compression codec used by this datareader.
+   * We should consider removing this from the interface.
+   * @return the compression codec
+   */
+  CompressionCodec getCompressionCodec();

Review comment:
       This interface looks like a copy of ORC's DataReader except this method. ORC's DataReader returns as StreamOptions instead of CompressionCodec. As far as i understand, StreamOptions includes the compression codec and more info. Morevover, I see other parts of the code already make use of StreamOptions so it should be possible to integrate this part of the code too. I do not understand why LlapDataReader interface is necessary.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java
##########
@@ -68,6 +70,10 @@
 
   private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024;
   private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024;
+  private static final GregorianCalendar PROLEPTIC = new GregorianCalendar();

Review comment:
       This is not used anywhere?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-773584083


   Thank you so much all! It's great!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-753947139


   its better to not use intermediate artifacts; snapshots doesnt play well with reproducible builds - if you just want to do a pre-flight check for the next orc version that could be done in a few different ways:
   * you may publish some pirate artifacts somewhere and we could add that repo to the artfiactory (and use some distinctive version number like: 1.6.6-pg01)
   * other way around could be to kill the rat for your runs by uising `-Drat.skip` (but...snapshots are not always updated so it will take the previous....)
   * you may also add instructions to also build orc during before the main hive build - that way it will build whatever you want
   
   last time when I had to do something similar I've choosen the  first option...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566908774



##########
File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out
##########
@@ -162,8 +162,8 @@ Stripe Statistics:
     Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7
     Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368
     Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736
-    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
-    Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
+    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808999999

Review comment:
       Yes, this is expected as are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568204044



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Even though TimeStamp with local timezone was added as part of [ORC-189](https://issues.apache.org/jira/browse/ORC-189) 

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553)  Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) 
   
   Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them  -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one.
   All I am trying to say here is that this is orthogonal to the ORC version bump.

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate this further

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate this further

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate further




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dongjoon-hyun commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-773584083


   Thank you so much all! It's great!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566784743



##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),

Review comment:
       Sure, makes sense -- extracted method above.

##########
File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null));

Review comment:
       In ORC-1.5 encryption is not supported -- Stripe info can also be encrypted since ORC-523 and thus the extra arguments here. Since we are not yet using encryption on LLAP the last 2 params can be null we I am keeping an incremental StripeId as it is used in a couple of places like the StripePlanner.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc.encoded;
+
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.OrcIndex;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/** An abstract data reader that IO formats can use to read bytes from underlying storage. */
+public interface LlapDataReader extends AutoCloseable, Cloneable {
+
+  /** Opens the DataReader, making it ready to use. */
+  void open() throws IOException;
+
+  OrcIndex readRowIndex(StripeInformation stripe,
+      TypeDescription fileSchema,
+      OrcProto.StripeFooter footer,
+      boolean ignoreNonUtf8BloomFilter,
+      boolean[] included,
+      OrcProto.RowIndex[] indexes,
+      boolean[] sargColumns,
+      OrcFile.WriterVersion version,
+      OrcProto.Stream.Kind[] bloomFilterKinds,
+      OrcProto.BloomFilterIndex[] bloomFilterIndices
+  ) throws IOException;
+
+  OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws IOException;
+
+  /** Reads the data.
+   *
+   * Note that for the cases such as zero-copy read, caller must release the disk ranges
+   * produced after being done with them. Call isTrackingDiskRanges to find out if this is needed.
+   * @param range List if disk ranges to read. Ranges with data will be ignored.
+   * @param baseOffset Base offset from the start of the file of the ranges in disk range list.
+   * @param doForceDirect Whether the data should be read into direct buffers.
+   * @return New or modified list of DiskRange-s, where all the ranges are filled with data.
+   */
+  DiskRangeList readFileData(
+      DiskRangeList range, long baseOffset, boolean doForceDirect) throws IOException;
+
+
+  /**
+   * Whether the user should release buffers created by readFileData. See readFileData javadoc.
+   */
+  boolean isTrackingDiskRanges();
+
+  /**
+   * Releases buffers created by readFileData. See readFileData javadoc.
+   * @param toRelease The buffer to release.
+   */
+  void releaseBuffer(ByteBuffer toRelease);
+
+  /**
+   * Clone the entire state of the DataReader with the assumption that the
+   * clone will be closed at a different time. Thus, any file handles in the
+   * implementation need to be cloned.
+   * @return a new instance
+   */
+  LlapDataReader clone();
+
+  @Override
+  void close() throws IOException;
+
+  /**
+   * Returns the compression codec used by this datareader.
+   * We should consider removing this from the interface.
+   * @return the compression codec
+   */
+  CompressionCodec getCompressionCodec();

Review comment:
       The interface is almost identical indeed, however the main issue the readFileData method above that takes a DiskRangeList instead of a BufferChunkList.
   The issue here is that ORC-1.6 uses a separate StripePlanner that reads RowGroups converting them to BufferChunks while in LLAP we are doing our own custom planning with DiskRanges.
   
   I am opening another thicket for this (moving LLAP to Stripe planning) but for now I believe it makes sense to keep the class.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java
##########
@@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata(
     OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata);
     OrcProto.Footer footer = orcTail.getFooter();
     int stripeCount = footer.getStripesCount();
-    boolean writerUsedProlepticGregorian = footer.hasCalendar()
-        ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN
-        : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf);
+    // Always convert To PROLEPTIC_GREGORIAN

Review comment:
       Hive is already using the proleptic calendar as default for processing and we seem to be converting the StripeStats to that since HIVE-22589.
   https://github.com/apache/orc/blob/32be030290905de9c2cd5b8cd84e210d8c0cf25c/java/core/src/java/org/apache/orc/impl/OrcTail.java#L114
   
   The main difference here is that the ORC Reader is now doing this transparently:
   https://github.com/apache/orc/blob/7542d04a2fee8437f2c12f72f9c647b4325bc7bb/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L817

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       Sure, created a wrapper instead reusing the same logic

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       This is related to ORC-685 where we added a way for the ReaderImpl to directly extractFileTail from a ByteBuffer.
   Using the buffer size is the right thing to do here as there can be a missmatch with the FileLength leading to bad tail extraction.
   
   

##########
File path: ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out
##########
@@ -687,11 +687,11 @@ POSTHOOK: Input: default@part_change_various_various_timestamp_n6
 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1
 #### A masked pattern was here ####
 insert_num	part	c1	c2	c3	c4	c5	c6	c7	c8	c9	c10	c11	c12	b
-101	1	1970-01-01 00:00:00.001	1969-12-31 23:59:59.872	NULL	1969-12-07 03:28:36.352	NULL	NULL	NULL	NULL	6229-06-28 02:54:28.970117179	6229-06-28 02:54:28.97011	6229-06-28 02:54:28.97011	1950-12-18 00:00:00	original
-102	1	1970-01-01 00:00:00	1970-01-01 00:00:00.127	1970-01-01 00:00:32.767	1970-01-25 20:31:23.647	NULL	NULL	NULL	NULL	5966-07-09 03:30:50.597	5966-07-09 03:30:50.597	5966-07-09 03:30:50.597	2049-12-18 00:00:00	original
+101	1	1970-01-01 00:00:01	1969-12-31 23:57:52	NULL	1901-12-13 20:45:52	NULL	NULL	NULL	NULL	6229-06-28 02:54:28.970117179	6229-06-28 02:54:28.97011	6229-06-28 02:54:28.97011	1950-12-18 00:00:00	original

Review comment:
       Yeah, I spend some time figuring this out myself, the root cause of this is a change in Integer to Timestamp conversion. In ORC-1.6 we use seconds while ORC-1.5 we use milliseconds.
   As a result casting (+1) to Timestamp will be: 1970-01-01 00:00:00.001 in ORC-1.5 while 1970-01-01 00:00:01 in ORC-1.6. 
    https://github.com/apache/orc/blob/f7dce53fb39cf9654641edfe2d3ad68c0a8ef7b8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L1499
    
   PS: also added this as user-facing change on the PR

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       Thats one of the breaking changes that case with encryption support for ORC.
   As Tail and thus StripeStatistics may be encrypted, we always need a reader instance to retrieve them.
   
   OrcTail maintained the API call for backwards compatibility but it still expects a reader to actually retrieve the stats:
   https://github.com/apache/orc/blob/d78cc39a9299b62bc8a5d8f5c3fac9201e03cb8b/java/core/src/java/org/apache/orc/impl/OrcTail.java#L210

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java
##########
@@ -68,6 +70,10 @@
 
   private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024;
   private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024;
+  private static final GregorianCalendar PROLEPTIC = new GregorianCalendar();

Review comment:
       ACK

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       ACK

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java
##########
@@ -282,6 +280,56 @@ public String toString() {
     }
   }
 
+  public static boolean[] findPresentStreamsByColumn(

Review comment:
       Sure, done

##########
File path: ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out
##########
@@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest
 #### A masked pattern was here ####
 -- BEGIN ORC FILE DUMP --
 #### A masked pattern was here ####
-File Version: 0.12 with ORC_517
+File Version: 0.12 with ORC_14

Review comment:
       Yes, ORC_14 is the latest Writer version (for 1.6) as per:
   
   https://github.com/apache/orc/blob/a1671f1e8abf748178c07e2a03b03f00268c8863/java/core/src/java/org/apache/orc/OrcFile.java#L177




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref edited a comment on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754556917


   > note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   > 
   > anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/ it for you - however that page returns error 400 for everything...
   
   Hey Zoltan -- I noticed the 400 myself for listing and non-existing files (GitHub policy?) but maven pulling seems to work, for example, check https://raw.githubusercontent.com/pgaref/mave-repo/main/org/apache/orc/orc-core/maven-metadata-local.xml


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r570105800



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Sure thing! This is now tracked as HIVE-24735 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566941929



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java
##########
@@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception {
         + "binary,string1:string,middle:struct<list:array<struct<int1:int,"
         + "string1:string>>>,list:array<struct<int1:int,string1:string>>,"
         + "map:map<string,struct<int1:int,string1:string>>,ts:timestamp,"
-        + "decimal1:decimal(38,18)>", readerInspector.getTypeName());
+        + "decimal1:decimal(38,10)>", readerInspector.getTypeName());

Review comment:
       Seems that that type scale was not properly propaged before and was using HiveDecimal.SYSTEM_DEFAULT_SCALE which is 18: https://github.com/apache/hive/blob/ff6f3565e50148b7bcfbcf19b970379f2bd59290/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcStruct.java#L607
   
   Current schema (decimal with scale 10) is actually in-line with ORC tests for the same file:
   https://github.com/apache/orc/blob/b54d10cedf5ec1529cf06d77268510c216402cba/java/core/src/test/org/apache/orc/impl/TestRecordReaderImpl.java#L141




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.X

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-756759959


   the thing is that before I added that repo I pressed the test button which reported the error - because I think it also does the same for blackout detection - it would have probably not worked...
   
   I've added the repsy to the artifactory and added to the virtual repo precommit uses


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-769755146


   > I only partially reviewed this. Will continue reviewing.
   > One question: I see we do not care about column encryption related arguments in multiple places. Is it because column encryption is not supported?
   
   Hey @mustage good question with a complicated answer -- while creating this I also did some digging to find out whats supported and what not. To sum up my findings: 
   
   -  It looks like we are currently able to encrypt entire tables and/or data on hdfs using kms: HIVE-8065
   -  Support for column level encryption/decryption (passing some encryption setting to the Table props and let Hive take care of the rest) started more than a while ago as part of HIVE-6329 
   -  There was a community discussion as part of HIVE-21848 to unify encryption table properties (at least for ORC and Parquet) that concluded in the accepted options
   - However, these properties are still not propagated to the tables: HIVE-21849
   
   I believe part of the reason is that Hive already integrates with Apache Ranger that can restrict user access to particular columns and also adds data-masking on top.
   However, I am more than happy discussing the revival of column encryption at some point.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-771057135


   Tests just passed and comments are addressed above.
   @mustafaiman @jcamachor please take another look and let me know what you think :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568567405



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       Sure, addressed as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527  -- also opened HIVE-24721 to investigate this further




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref edited a comment on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754556917


   > note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   > 
   > anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/ it for you - however that page returns error 400 for everything...
   
   Hey Zoltan -- I noticed the 400 myself for listing and non-existing files (GitHub policy?) but maven pulling seems to work, for example, check https://raw.githubusercontent.com/pgaref/mave-repo/main/org/apache/orc/orc-core/maven-metadata-local.xml
   
   To be on the safe side though I created this Repsy public repo: https://repo.repsy.io/mvn/pgaref/repository
   Feel free to add this instead :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r568174337



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(),

Review comment:
       LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553) 
   
   Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) 
   
   Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them  -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one.
   All I am trying to say here is that this is orthogonal the ORC version bump.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r567799059



##########
File path: ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction_multicol.q.out
##########
@@ -355,7 +355,7 @@ Stage-1 HIVE COUNTERS:
    RECORDS_OUT_OPERATOR_TS_3: 800
    TOTAL_TABLE_ROWS_WRITTEN: 0
 Stage-1 LLAP IO COUNTERS:
-   CACHE_HIT_BYTES: 138344
+   CACHE_MISS_BYTES: 138342

Review comment:
       This was a bit more complex, CacheWriter.getSparseOrcIndexFromDenseDest was called with colId = 0 from SerDeEncodedDataReader -- causing IndexOutOfBounds and Cache not being populated.
   
   This is now addressed by https://github.com/apache/hive/pull/1823/commits/da1aa077716a65c2a02d850828b16cdeece1f574




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r564992785



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java
##########
@@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata(
     OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata);
     OrcProto.Footer footer = orcTail.getFooter();
     int stripeCount = footer.getStripesCount();
-    boolean writerUsedProlepticGregorian = footer.hasCalendar()
-        ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN
-        : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf);
+    // Always convert To PROLEPTIC_GREGORIAN

Review comment:
       Why is it OK to use proleptic calendar always here? Could we leave short explanation in the comment for when we need to revisit this code?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java
##########
@@ -282,6 +280,56 @@ public String toString() {
     }
   }
 
+  public static boolean[] findPresentStreamsByColumn(

Review comment:
       Can we add javadoc for these public static utility methods? If they are used only in this class, should we change their visibility?

##########
File path: ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out
##########
@@ -687,11 +687,11 @@ POSTHOOK: Input: default@part_change_various_various_timestamp_n6
 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1
 #### A masked pattern was here ####
 insert_num	part	c1	c2	c3	c4	c5	c6	c7	c8	c9	c10	c11	c12	b
-101	1	1970-01-01 00:00:00.001	1969-12-31 23:59:59.872	NULL	1969-12-07 03:28:36.352	NULL	NULL	NULL	NULL	6229-06-28 02:54:28.970117179	6229-06-28 02:54:28.97011	6229-06-28 02:54:28.97011	1950-12-18 00:00:00	original
-102	1	1970-01-01 00:00:00	1970-01-01 00:00:00.127	1970-01-01 00:00:32.767	1970-01-25 20:31:23.647	NULL	NULL	NULL	NULL	5966-07-09 03:30:50.597	5966-07-09 03:30:50.597	5966-07-09 03:30:50.597	2049-12-18 00:00:00	original
+101	1	1970-01-01 00:00:01	1969-12-31 23:57:52	NULL	1901-12-13 20:45:52	NULL	NULL	NULL	NULL	6229-06-28 02:54:28.970117179	6229-06-28 02:54:28.97011	6229-06-28 02:54:28.97011	1950-12-18 00:00:00	original

Review comment:
       This shifting for timestamp values does not seem right (or at least I cannot make sense of it). Could you explain what is going on here? Some of the shifting is significant: For those, I remember there were some backwards incompatible changes in schema evolution in 1.6.x, it may be related to that? However, other shifting seems a bit more suspicious, e.g., 1 second, ~2 minutes?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       As @mustafaiman  mentioned, I think this should be always false indeed: TIMESTAMP_INSTANT is equivalent to TIMESTAMP_WITH_LOCAL_TIME_ZONE type in Hive. AFAIK support to read/write timestamp with local time zone in ORC is not implemented yet.

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java
##########
@@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception {
         + "binary,string1:string,middle:struct<list:array<struct<int1:int,"
         + "string1:string>>>,list:array<struct<int1:int,string1:string>>,"
         + "map:map<string,struct<int1:int,string1:string>>,ts:timestamp,"
-        + "decimal1:decimal(38,18)>", readerInspector.getTypeName());
+        + "decimal1:decimal(38,10)>", readerInspector.getTypeName());

Review comment:
       Change in decimal scale. Expected?

##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -249,15 +249,15 @@ Stripes:
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 0.029 expectedFpp: 7.090246E-7
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 loadFactor: 0.2825 expectedFpp: 0.0063713384
     Row group indices for column 9:
-      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,7,488,0,1538,488
+      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 positions: 0,0,0,0,0,0
+      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703076 max: 2013-03-01 09:11:58.703325 positions: 0,7,488,0,1538,488
     Bloom filters for column 9:
       Entry 0: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13
     Row group indices for column 10:
-      Entry 0: count: 1000 hasNull: false min: 8 max: 9994 sum: 5118211 positions: 0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 248 max: 9490 sum: 246405 positions: 0,2194,0,4,488
+      Entry 0: count: 1000 hasNull: false min: 0.08 max: 99.94 sum: 51182.11 positions: 0,0,0,0,0

Review comment:
       min/max changed. Expected?

##########
File path: ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction_multicol.q.out
##########
@@ -355,7 +355,7 @@ Stage-1 HIVE COUNTERS:
    RECORDS_OUT_OPERATOR_TS_3: 800
    TOTAL_TABLE_ROWS_WRITTEN: 0
 Stage-1 LLAP IO COUNTERS:
-   CACHE_HIT_BYTES: 138344
+   CACHE_MISS_BYTES: 138342

Review comment:
       Expected? Was it a typo?

##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -111,7 +111,7 @@ Stripe Statistics:
     Column 6: count: 1049 hasNull: false bytesOnDisk: 3323 min: 0.02 max: 49.85 sum: 26286.349999999977
     Column 7: count: 1049 hasNull: false bytesOnDisk: 137 true: 526
     Column 8: count: 1049 hasNull: false bytesOnDisk: 3430 min:  max: zach zipper sum: 13443
-    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703
+    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325

Review comment:
       Timestamp precision change. Expected?

##########
File path: ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out
##########
@@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest
 #### A masked pattern was here ####
 -- BEGIN ORC FILE DUMP --
 #### A masked pattern was here ####
-File Version: 0.12 with ORC_517
+File Version: 0.12 with ORC_14

Review comment:
       Is this change expected?

##########
File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out
##########
@@ -162,8 +162,8 @@ Stripe Statistics:
     Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7
     Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368
     Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736
-    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
-    Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808
+    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808999999

Review comment:
       Change in timestamp precision in stats. Expected?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566909094



##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -111,7 +111,7 @@ Stripe Statistics:
     Column 6: count: 1049 hasNull: false bytesOnDisk: 3323 min: 0.02 max: 49.85 sum: 26286.349999999977
     Column 7: count: 1049 hasNull: false bytesOnDisk: 137 true: 526
     Column 8: count: 1049 hasNull: false bytesOnDisk: 3430 min:  max: zach zipper sum: 13443
-    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703
+    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325

Review comment:
       Yes, this is expected as are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref edited a comment on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754556917


   > note: I tend to use distinct version numbers because snapshots might get cached and not updated - but that was an issue with the old ptest infra; I guess the current setup will handle that better...
   > 
   > anything will do which can serve a web page - I wanted to add https://raw.githubusercontent.com/pgaref/mave-repo/main/ it for you - however that page returns error 400 for everything...
   
   Hey Zoltan -- I noticed the 400 myself for listings and non-existing files (GitHub policy?) but maven pulling seems to work, for example, check https://raw.githubusercontent.com/pgaref/mave-repo/main/org/apache/orc/orc-core/maven-metadata-local.xml
   
   To be on the safe side though I created this Repsy public repo: https://repo.repsy.io/mvn/pgaref/repository
   Feel free to add this instead :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-771227835


   > All q.out files show data size increase for tables. Since most of them are consistently additional 4 bytes per row, that seems like not a bug. However, I found some irregular increases too like 16 bytes per row. Can you explain why data size increased so we can check the irregularities and make sure they are expected?
   
   Hey @mustafaiman -- the main size differences are on Timestamp columns where we now support nanosecond precision (using 2 extra variables for the lower and the upper precision as part of the stats -- see [ORC-611](https://issues.apache.org/jira/browse/ORC-611)).
   
   Other than that there are other changes that can also affect size, such as: Trimming StringStatistics minimum and maximum values as part of ORC-203  or List and Map column statistics that was recently added as part of ORC-398.
   
   Happy to check further if you have doubts about a particular query.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.6

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-754115075


   > its better to not use intermediate artifacts; snapshots doesnt play well with reproducible builds - if you just want to do a pre-flight check for the next orc version that could be done in a few different ways:
   > 
   > * you may publish some pirate artifacts somewhere and we could add that repo to the artfiactory (and use some distinctive version number like: 1.6.6-pg01)
   > * other way around could be to kill the rat for your runs by uising `-Drat.skip` (but...snapshots are not always updated so it will take the previous....)
   > * you may also add instructions to also build orc during before the main hive build - that way it will build whatever you want
   > 
   > last time when I had to do something similar I've choosen the first option...
   
   Thank you for the detailed explanation @kgyrtkirk ! I did try the first option as well locally (that can be useful in the future as well).
   For a private repo I used a GitHub repo as the simplest solution -- used 1.6.7-SNAPSHOT for testing but can be replaced with any distinct version number for the actual build).
   Shall we add **https://raw.githubusercontent.com/pgaref/mave-repo/main/** to the Jenkins artifactory for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org