You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/19 12:57:52 UTC

[GitHub] [iceberg] szlta opened a new pull request #2613: Hive: Vectorized ORC reads for Hive

szlta opened a new pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613


   Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat for reading data from Iceberg. This in turn relies on the internal reader implementations for the various file formats, such as GenericOrcReader(s). Unfortunately for Hive, the whole read pipeline is record based, whereas GenericRecord instances hold the information on which Hive has to use the Iceberg object inspectors to retrieve the encoded data from.
   
   From a design point of view I see 3 layers where during reading data flows upwards:
   - Execution engine: Hive
   - Table format: Iceberg
   - File format: ORC/Parquet/Avro
   
   The current design however combines the two bottom layers, and although Hive already has support for vectorized execution, and already has implementations for vectorized reading from various file formats; with Iceberg in the picture it can't have the advantages of vectorization.
   
   In this change I propose to swap out the bottom layer with the implementations Hive already has, so in the case of ORC: with VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive expects as its vectorized format. This could be used as the 'Hive' in memory data model, something IcebergInputFormat already distinguishes upon.
   
   One big obstacle is that Iceberg depends on a type of ORC that has the nohive classifier. This means that classes that are actually coming from Hive/storage-api are shaded inside ORC with an ORC classname. In order to revert this while also causing the minimal disruption to other parts of Iceberg I propose to create an intermediary module for iceberg-hive3 to depend on (rather then depending on iceberg-orc and iceberg-data)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636015324



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -126,6 +127,10 @@ public void configureJobConf(TableDesc tableDesc, JobConf jobConf) {
         jobConf.set(InputFormatConfig.TABLE_CATALOG_PREFIX + tableName, catalogName);
       }
     }
+    if (HiveConf.getBoolVar(jobConf, HiveConf.ConfVars.HIVE_VECTORIZATION_ENABLED)) {

Review comment:
       nit: newline




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

rymurr commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r640465542



##########
File path: build.gradle
##########
@@ -551,6 +551,60 @@ project(':iceberg-mr') {
 }
 
 if (jdkVersion == '8') {
+  // The purpose of this module is to re-shade org.apache.orc.storage to the original org.apache.hadoop.hive package
+  // name. This is to be used by Hive3 for features including e.g. vectorization.
+  project(':iceberg-hive3-orc-bundle') {

Review comment:
       how does this compare to the original hive bundle? Do we need to support both? Apologies if thats a dumb question, I am just not 100% sure I understand what will be in this resulting bundle and who will use it.

##########
File path: mr/src/main/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedSupport.java
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Copied here from Hive for compatibility
+ */
+@SuppressWarnings({"Indentation", "WhitespaceAfter", "VisibilityModifier"})

Review comment:
       Same as previous comment

##########
File path: hive3/src/main/java/org/apache/iceberg/mr/hive/vector/CompatibilityHiveVectorUtils.java
##########
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.mr.hive.vector;
+
+import java.sql.Date;
+import java.sql.Timestamp;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.common.type.HiveIntervalDayTime;
+import org.apache.hadoop.hive.common.type.HiveIntervalYearMonth;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.tez.DagUtils;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.IntervalDayTimeColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.TimestampColumnVector;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.MapWork;
+import org.apache.hadoop.hive.serde2.io.DateWritable;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Contains ported code snippets from later Hive sources. We should get rid of this class as soon as Hive 4 is released
+ * and Iceberg makes a dependency to that version.
+ */
+public class CompatibilityHiveVectorUtils {
+
+  private static final Logger LOG = LoggerFactory.getLogger(CompatibilityHiveVectorUtils.class);
+
+  private CompatibilityHiveVectorUtils() {
+
+  }
+
+  /**
+   * Returns serialized mapwork instance from a job conf - ported from Hive source code LlapHiveUtils#findMapWork
+   *
+   * @param job JobConf instance
+   * @return
+   */
+  @SuppressWarnings("Slf4jConstantLogMessage")
+  public static MapWork findMapWork(JobConf job) {
+    String inputName = job.get(Utilities.INPUT_NAME, null);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Initializing for input " + inputName);
+    }
+    String prefixes = job.get(DagUtils.TEZ_MERGE_WORK_FILE_PREFIXES);
+    if (prefixes != null && !StringUtils.isBlank(prefixes)) {
+      // Currently SMB is broken, so we cannot check if it's  compatible with IO elevator.
+      // So, we don't use the below code that would get the correct MapWork. See HIVE-16985.
+      return null;
+    }
+
+    BaseWork work = null;
+    // HIVE-16985: try to find the fake merge work for SMB join, that is really another MapWork.
+    if (inputName != null) {
+      if (prefixes == null || !Lists.newArrayList(prefixes.split(",")).contains(inputName)) {
+        inputName = null;
+      }
+    }
+    if (inputName != null) {
+      work = Utilities.getMergeWork(job, inputName);
+    }
+
+    if (!(work instanceof MapWork)) {
+      work = Utilities.getMapWork(job);
+    }
+    return (MapWork) work;
+  }
+
+
+  /**
+   * Ported from Hive source code VectorizedRowBatchCtx#addPartitionColsToBatch
+   *
+   * @param col ColumnVector to write the partition value into
+   * @param value partition value
+   * @param partitionColumnName partition key
+   * @param rowColumnTypeInfo column type description
+   */
+  @SuppressWarnings({"AvoidNestedBlocks", "FallThrough", "MethodLength", "CyclomaticComplexity", "Indentation"})

Review comment:
       Any chance that instead of porting ropey code into iceberg we could write this better and upstream the change?

##########
File path: hive3/src/main/java/org/apache/iceberg/mr/hive/vector/CompatibilityHiveVectorUtils.java
##########
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.mr.hive.vector;
+
+import java.sql.Date;
+import java.sql.Timestamp;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.common.type.HiveIntervalDayTime;
+import org.apache.hadoop.hive.common.type.HiveIntervalYearMonth;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.tez.DagUtils;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DoubleColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.IntervalDayTimeColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.TimestampColumnVector;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.MapWork;
+import org.apache.hadoop.hive.serde2.io.DateWritable;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Contains ported code snippets from later Hive sources. We should get rid of this class as soon as Hive 4 is released
+ * and Iceberg makes a dependency to that version.
+ */
+public class CompatibilityHiveVectorUtils {
+
+  private static final Logger LOG = LoggerFactory.getLogger(CompatibilityHiveVectorUtils.class);
+
+  private CompatibilityHiveVectorUtils() {
+
+  }
+
+  /**
+   * Returns serialized mapwork instance from a job conf - ported from Hive source code LlapHiveUtils#findMapWork
+   *
+   * @param job JobConf instance
+   * @return
+   */
+  @SuppressWarnings("Slf4jConstantLogMessage")

Review comment:
       Why suppress the warning instead of fixing the issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636015915



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapred/AbstractMapredIcebergRecordReader.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.mapred;
+
+import java.io.IOException;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.iceberg.mr.mapreduce.IcebergSplit;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")

Review comment:
       nit: Why was this needed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636074965



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapred/AbstractMapredIcebergRecordReader.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.mapred;
+
+import java.io.IOException;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.iceberg.mr.mapreduce.IcebergSplit;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")

Review comment:
       So that innerReader can be accessed in inheriting classes. IMHO checkstyle is too strict about using protected access modifier for this case...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#issuecomment-853816726


   Thanks for the review @rymurr!
   
   @rdblue, @RussellSpitzer: Would you like to take a look before we merge this change? If you have time, I would really appreciate it!
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636072594



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -166,14 +168,32 @@ private static void checkResiduals(CombinedScanTask task) {
   }
 
   private static final class IcebergRecordReader<T> extends RecordReader<Void, T> {
+
+    private static final String HIVE_VECTORIZED_READER_CLASS = "org.apache.iceberg.mr.hive.vector.HiveVectorizedReader";
+    private static final DynMethods.StaticMethod HIVE_VECTORIZED_READER_BUILDER;
+
+    static {
+      if (MetastoreUtil.hive3PresentOnClasspath()) {
+        HIVE_VECTORIZED_READER_BUILDER = DynMethods.builder("reader")

Review comment:
       This is the HiveVectorizedReader utility class that belongs to Iceberg codebase, it is located in iceberg-hive3 module. Vectorization for Hive is only supported in iceberg-hive3 module, so here in the MR module we need to dynamically load this class (IcebergInputFormat is reused in iceberg-hive3 too)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta commented on pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta commented on pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#issuecomment-855829298


   Thanks for the reviews and the commit!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636013819



##########
File path: hive3/src/main/java/org/apache/iceberg/mr/hive/vector/VectorizedRowBatchIterator.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.mr.hive.vector;
+
+import java.io.IOException;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.iceberg.io.CloseableIterator;
+
+/**
+ * Iterator wrapper around Hive's VectorizedRowBatch producer (MRv1 implementing) record readers.
+ */
+public final class VectorizedRowBatchIterator implements CloseableIterator<VectorizedRowBatch> {
+
+  private final RecordReader<NullWritable, VectorizedRowBatch> recordReader;
+  private final NullWritable key;
+  private final VectorizedRowBatch batch;
+  private final VectorizedRowBatchCtx vrbCtx;
+  private final int[] partitionColIndices;
+  private final Object[] partitionValues;
+  private boolean advanced = false;
+
+  VectorizedRowBatchIterator(RecordReader<NullWritable, VectorizedRowBatch> recordReader, JobConf job,
+                             int[] partitionColIndices, Object[] partitionValues) {
+    this.recordReader = recordReader;
+    this.key = recordReader.createKey();
+    this.batch = recordReader.createValue();
+    this.vrbCtx = CompatibilityHiveVectorUtils.findMapWork(job).getVectorizedRowBatchCtx();
+    this.partitionColIndices = partitionColIndices;
+    this.partitionValues = partitionValues;
+  }
+
+  @Override
+  public void close() throws IOException {
+    this.recordReader.close();
+  }
+
+  private void advance() {
+    if (!advanced) {
+      try {
+        if (!recordReader.next(key, batch)) {
+          batch.size = 0;
+        }

Review comment:
       nit: newline




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta commented on pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta commented on pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#issuecomment-847930235


   cc: @rdblue @rymurr 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r642430232



##########
File path: build.gradle
##########
@@ -551,6 +551,60 @@ project(':iceberg-mr') {
 }
 
 if (jdkVersion == '8') {
+  // The purpose of this module is to re-shade org.apache.orc.storage to the original org.apache.hadoop.hive package
+  // name. This is to be used by Hive3 for features including e.g. vectorization.
+  project(':iceberg-hive3-orc-bundle') {

Review comment:
       Thanks for taking a look on this @rymurr!
   Originally I wanted to stick to not modifying any code snippets that needs to be copied from Hive to Iceberg as it's meant to be temporary. But.. since I don't see Hive4 getting released in the foreseeable future yet I have now rather amended these as per checkstyle.
   
   About this hive3-orc-bundle: it's something that only hive3 depends on, and the soul purpose of it is to re-shade hive-storage-api classes that are currently being referred to as org.apache.orc.storage... within Iceberg.
   This bundle will produce iceberg-data, and iceberg-orc classes with the original package name references to storage-api classes which is required runtime, where Hive3 code (from Hive codebase) is already linked with the original class names.
   The reason this is not a dependency of iceberg-mr module, is that we won't support vectorization for the Hive2 integration of Iceberg




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#issuecomment-853816726


   Thanks for the review @rymurr!
   
   @rdblue, @RussellSpitzer: Would you like to take a look before we merge this change? If you have time, I would really appreciate it!
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636016297



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapred/MapredIcebergInputFormat.java
##########
@@ -66,9 +67,9 @@ public MapredIcebergInputFormat() {
   @Override
   public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
     return innerInputFormat.getSplits(newJobContext(job))
-                           .stream()
-                           .map(InputSplit.class::cast)
-                           .toArray(InputSplit[]::new);
+            .stream()

Review comment:
       nit: Formatting only change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636014090



##########
File path: hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java
##########
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive.vector;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.lang3.ArrayUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.orc.OrcSplit;
+import org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat;
+import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.PartitionField;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.CloseableIterator;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.mr.mapred.MapredIcebergInputFormat;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/**
+ * Utility class to create vectorized readers for Hive.
+ * As per the file format of the task, it will create a matching vectorized record reader that is already implemented
+ * in Hive. It will also do some tweaks on the produced vectors for Iceberg's use e.g. partition column handling.
+ */
+public class HiveVectorizedReader {
+
+
+  private HiveVectorizedReader() {
+
+  }
+
+  public static <D> CloseableIterable<D> reader(InputFile inputFile, FileScanTask task, Map<Integer, ?> idToConstant,
+      TaskAttemptContext context) {
+    JobConf job = (JobConf) context.getConfiguration();
+    Path path = new Path(inputFile.location());
+    FileFormat format = task.file().format();
+    Reporter reporter = ((MapredIcebergInputFormat.CompatibilityTaskAttemptContextImpl) context).getLegacyReporter();
+
+    // Hive by default requires partition columns to be read too. This is not required for identity partition
+    // columns, as we will add this as constants later.
+
+    int[] partitionColIndices = null;
+    Object[] partitionValues = null;
+    PartitionSpec partitionSpec = task.spec();
+    if (!partitionSpec.isUnpartitioned()) {
+      List<Integer> readColumnIds = ColumnProjectionUtils.getReadColumnIDs(job);
+
+      List<PartitionField> fields = partitionSpec.fields();
+      List<Integer> partitionColIndicesList = Lists.newLinkedList();
+      List<Object> partitionValuesList = Lists.newLinkedList();
+
+      for (PartitionField field : fields) {
+        if (field.transform().isIdentity()) {
+          // Skip reading identity partition columns from source file...
+          int hiveColIndex = field.sourceId() - 1;
+          readColumnIds.remove((Integer) hiveColIndex);
+
+          // ...and use the corresponding constant value instead
+          partitionColIndicesList.add(hiveColIndex);
+          partitionValuesList.add(idToConstant.get(field.sourceId()));
+
+        }
+      }
+      partitionColIndices = ArrayUtils.toPrimitive(partitionColIndicesList.toArray(new Integer[0]));

Review comment:
       nit: newline




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#issuecomment-855742959


   Thanks @szlta for the PR and @rymurr for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636017257



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -166,14 +168,32 @@ private static void checkResiduals(CombinedScanTask task) {
   }
 
   private static final class IcebergRecordReader<T> extends RecordReader<Void, T> {
+
+    private static final String HIVE_VECTORIZED_READER_CLASS = "org.apache.iceberg.mr.hive.vector.HiveVectorizedReader";
+    private static final DynMethods.StaticMethod HIVE_VECTORIZED_READER_BUILDER;
+
+    static {
+      if (MetastoreUtil.hive3PresentOnClasspath()) {
+        HIVE_VECTORIZED_READER_BUILDER = DynMethods.builder("reader")

Review comment:
       Are these classes are missing in Hive2?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636013473



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
##########
@@ -70,20 +95,40 @@
 
     String location = job.get(InputFormatConfig.TABLE_LOCATION);
     return Arrays.stream(super.getSplits(job, numSplits))
-                 .map(split -> new HiveIcebergSplit((IcebergSplit) split, location))
-                 .toArray(InputSplit[]::new);
+            .map(split -> new HiveIcebergSplit((IcebergSplit) split, location))
+            .toArray(InputSplit[]::new);
   }
 
   @Override
   public RecordReader<Void, Container<Record>> getRecordReader(InputSplit split, JobConf job,
                                                                Reporter reporter) throws IOException {
     String[] selectedColumns = ColumnProjectionUtils.getReadColumnNames(job);
     job.setStrings(InputFormatConfig.SELECTED_COLUMNS, selectedColumns);
-    return super.getRecordReader(split, job, reporter);
+
+    if (HiveConf.getBoolVar(job, HiveConf.ConfVars.HIVE_VECTORIZATION_ENABLED)) {
+      assert MetastoreUtil.hive3PresentOnClasspath();

Review comment:
       Should we check this on runtime as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary merged pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary merged pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2613: Hive: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613#discussion_r636013620



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
##########
@@ -70,20 +95,40 @@
 
     String location = job.get(InputFormatConfig.TABLE_LOCATION);
     return Arrays.stream(super.getSplits(job, numSplits))
-                 .map(split -> new HiveIcebergSplit((IcebergSplit) split, location))
-                 .toArray(InputSplit[]::new);
+            .map(split -> new HiveIcebergSplit((IcebergSplit) split, location))

Review comment:
       Is this formatting only change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org