You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/30 07:02:26 UTC

[GitHub] [incubator-iceberg] rdsr opened a new pull request #989: [WIP] Orc nested partition support

rdsr opened a new pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989


   Fixes #897
   
   The  following changes are made
   1. Added a OrcTypeWithSchemaVisitor
   2. Added support for nested identity partition similar to #896 
   3. Refactored a lot of data reader functions which can now be shared across Spark and Iceberg Generics for ORC
   4. SparkOrcReader is simplified in the process. I've moved away from  Spark `UnsafeRow` and used GenericInternalRow instead similar to Spark Avro and Parquet readers. The downside is that the performance had reduced when compared with #900 . I'll add some performance numbers later. The reason could be that I'm not reusing GenericInternalRow instances. I'll see if I can fix that
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421676500



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestSparkOrcReader.java
##########
@@ -71,7 +71,7 @@ private void writeAndValidateRecords(Schema schema, Iterable<InternalRow> expect
 
     try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile))
         .project(schema)
-        .createReaderFunc(SparkOrcReader::new)

Review comment:
       I think this should test with and without container reuse if that is implemented in this PR. Probably just make this test parameterized.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417797286



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+
+
+public interface OrcValReader<T> {

Review comment:
       I want to rename this to OrcValueReader, but an interface with that name already exists. I can rename the other interface OrcRowReader instead. If folks are OK doing this as part of this RB, let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-632767571


   This looks good to me, except for the conflicts. Can you rebase and we can commit it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r428298714



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       @rdblue . Not sure I follow  this..
   > The Spark/Parquet reader doesn't currently reuse containers.
   
   I did the JMH testing for ORC before and after my patch 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417798193



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       My guess would be this is what is tanking the perf. Spark[Avro|Parquet]Reader seem to reuse this object. Doing the same for ORC results in test failures




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429328821



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -109,23 +104,8 @@
     Iterator<InternalRow> iter;
 
     if (hasJoinedPartitionColumns) {
-      if (SUPPORTS_CONSTANTS.contains(file.format())) {
-        iterSchema = requiredSchema;
-        iter = open(task, requiredSchema, PartitionUtil.constantsMap(task, RowDataReader::convertConstant));
-      } else {
-        // schema used to read data files
-        Schema readSchema = TypeUtil.selectNot(requiredSchema, idColumns);
-        Schema partitionSchema = TypeUtil.select(requiredSchema, idColumns);
-        PartitionRowConverter convertToRow = new PartitionRowConverter(partitionSchema, spec);
-        JoinedRow joined = new JoinedRow();
-
-        InternalRow partition = convertToRow.apply(file.partition());
-        joined.withRight(partition);
-
-        // create joined rows and project from the joined schema to the final schema
-        iterSchema = TypeUtil.join(readSchema, partitionSchema);
-        iter = Iterators.transform(open(task, readSchema, ImmutableMap.of()), joined::withLeft);
-      }
+      iterSchema = requiredSchema;
+      iter = open(task, requiredSchema, PartitionUtil.constantsMap(task, RowDataReader::convertConstant));

Review comment:
       #1004 is in. That might be why this has conflicts.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429051777



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private InternalRow internalRow;

Review comment:
       In this case the reuse is only being done for the top level row and not for any internal row. Do u still see this as an issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r420997677



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -496,7 +503,7 @@ private static boolean isRequired(TypeDescription orcType) {
     }
   }
 
-  private static int getMaxIcebergId(TypeDescription originalOrcSchema) {
+  static int getMaxIcebergId(TypeDescription originalOrcSchema) {

Review comment:
       I don't see this used other than in the conversion logic to assign new IDs. Because that assignment is actually a correctness bug, we don't need this method at all. Also, since it isn't used anywhere else in this PR there is no need to make it package-private instead of private.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-632545272


   @rdblue This is ready for another round of review


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421666462



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
##########
@@ -53,425 +40,83 @@
  * It minimizes allocations by reusing most of the objects in the implementation.
  */
 public class SparkOrcReader implements OrcValueReader<InternalRow> {
-  private static final int INITIAL_SIZE = 128 * 1024;
-  private final List<TypeDescription> columns;
-  private final Converter[] converters;
-  private final UnsafeRowWriter rowWriter;
+  private final SparkOrcValueReaders.StructReader reader;
 
-  public SparkOrcReader(TypeDescription readOrcSchema) {
-    columns = readOrcSchema.getChildren();
-    converters = buildConverters();
-    rowWriter = new UnsafeRowWriter(columns.size(), INITIAL_SIZE);
+  public SparkOrcReader(org.apache.iceberg.Schema expectedSchema, TypeDescription readSchema) {
+    this(expectedSchema, readSchema, ImmutableMap.of());
   }
 
-  private Converter[] buildConverters() {
-    final Converter[] newConverters = new Converter[columns.size()];
-    for (int c = 0; c < newConverters.length; ++c) {
-      newConverters[c] = buildConverter(columns.get(c));
-    }
-    return newConverters;
+  @SuppressWarnings("unchecked")
+  public SparkOrcReader(
+      org.apache.iceberg.Schema expectedSchema, TypeDescription readOrcSchema, Map<Integer, ?> idToConstant) {
+    reader = (SparkOrcValueReaders.StructReader) OrcSchemaWithTypeVisitor.visit(
+        expectedSchema, readOrcSchema, new ReadBuilder(idToConstant));
   }
 
   @Override
   public InternalRow read(VectorizedRowBatch batch, int row) {
-    rowWriter.reset();
-    rowWriter.zeroOutNullBytes();
-    for (int c = 0; c < batch.cols.length; ++c) {
-      converters[c].convert(rowWriter, c, batch.cols[c], row);
-    }
-    return rowWriter.getRow();
+    return reader.read(batch, row);

Review comment:
       This is where you could wrap the columns from `VectorizedRowBatch` in a `StructColumnVector`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-622412537


   > I reran the Spark jmh tests which @shardulm94 had written on a fresh checkout of iceberg and against my patch. Here's the output
   > 
   > Fresh checkout of Iceberg
   > 
   > ```
   > Benchmark                                                                          Mode  Cnt   Score   Error  Units
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op
   > ```
   > 
   > With the ORC nested partition patch applied
   > 
   > ```
   > Benchmark                                                                          Mode  Cnt   Score   Error  Units
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op
   > ```
   > 
   > Seems like there isn't much difference.
   
   cc @shardulm94 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429090561



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestSparkOrcReader.java
##########
@@ -71,7 +71,7 @@ private void writeAndValidateRecords(Schema schema, Iterable<InternalRow> expect
 
     try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile))
         .project(schema)
-        .createReaderFunc(SparkOrcReader::new)

Review comment:
       For now I've removed the reuse code. We can tackle than in followup




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429327828



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {
+      return readInternal(reuseOrCreate(), batch.cols, row);
+    }
+
+    private T readInternal(T struct, ColumnVector[] columnVectors, int row) {
+      for (int c = 0; c < readers.length; ++c) {
+        set(struct, c, reader(c).read(columnVectors[c], row));

Review comment:
       Yeah, that sounds good.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429326033



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -308,7 +309,7 @@ private static TypeDescription buildOrcProjection(Integer fieldId, Type type, bo
           orcType = convert(fieldId, type, false);
         }
     }
-
+    orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());

Review comment:
       It would be great to know what's going on here. Since this is just a projection schema and the reader is built with the Iceberg schema (that has required/optional), I don't think it is really a blocker. But setting a property here shouldn't cause ORC to fail, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429339656



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -308,7 +309,7 @@ private static TypeDescription buildOrcProjection(Integer fieldId, Type type, bo
           orcType = convert(fieldId, type, false);
         }
     }
-
+    orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());

Review comment:
       I'll file the necessary followups. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417798193



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       ~~My guess would be this is what is tanking the perf. Spark[Avro|Parquet]Reader seem to reuse this object. Doing the same for ORC results in test failures~~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] abti commented on a change in pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

abti commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r418198139



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.List;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+
+
+public abstract class OrcSchemaWithTypeVisitor<T> {
+  public static <T> T visit(
+      org.apache.iceberg.Schema iSchema, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    return visit(iSchema.asStruct(), schema, visitor);
+  }
+
+  public static <T> T visit(Type iType, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    switch (schema.getCategory()) {
+      case STRUCT:
+        return visitRecord(iType.asStructType(), schema, visitor);
+
+      case UNION:
+        // We don't have an answer for union types.
+        throw new IllegalArgumentException("Can't handle " + schema);

Review comment:
       how about using UnsupportedOperationException instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue merged pull request #989: ORC: Supported nested identity partition data

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421921069



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {

Review comment:
       I think I did what you are suggesting before, but ran into issues. Let me try again to get the actual erorr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421656784



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {

Review comment:
       Where is this used? It doesn't look like using 0-length arrays for positions and constants is correct.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421654780



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {

Review comment:
       Do we need a short reader? Iceberg doesn't support short values in a table schema, so we shouldn't be expecting to ever return a short value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on pull request #989: ORC: Supported nested identity partition data

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-632848728


   Looks good to me! I'll merge it. Thanks, @rdsr!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r428302411



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {

Review comment:
       Ohk. I can change this to ints




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421663251



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {

Review comment:
       Nevermind, I see what's happening.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r419022550



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       do not see any differece :)

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       do not see any difference :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421661929



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {

Review comment:
       Instead of having an interface that uses `VectorizedRowBatch`, why not just have the top-level reader create a `StructColumnVector` and pass the batch's columns using that?
   
   You can even make a reusable `StructColumnVector` to avoid needing to create a new one for every batch if you want:
   
   ```java
     class ReusableStructColumnVector extends StructColumnVector {
       public ReusableStructColumnVector(int len, ColumnVector... fields) {
         super(len, fields);
       }
   
       public ReusableStructColumnVector replaceVectors(ColumnVector[] fields) {
         this.fields = fields;
         return this;
       }
     }
   ```
   
   Then we'd only need one `OrcValueReader` interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429082398



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private InternalRow internalRow;

Review comment:
       I ended up removing this for now.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
##########
@@ -53,425 +40,83 @@
  * It minimizes allocations by reusing most of the objects in the implementation.
  */
 public class SparkOrcReader implements OrcValueReader<InternalRow> {
-  private static final int INITIAL_SIZE = 128 * 1024;
-  private final List<TypeDescription> columns;
-  private final Converter[] converters;
-  private final UnsafeRowWriter rowWriter;
+  private final SparkOrcValueReaders.StructReader reader;
 
-  public SparkOrcReader(TypeDescription readOrcSchema) {
-    columns = readOrcSchema.getChildren();
-    converters = buildConverters();
-    rowWriter = new UnsafeRowWriter(columns.size(), INITIAL_SIZE);
+  public SparkOrcReader(org.apache.iceberg.Schema expectedSchema, TypeDescription readSchema) {
+    this(expectedSchema, readSchema, ImmutableMap.of());
   }
 
-  private Converter[] buildConverters() {
-    final Converter[] newConverters = new Converter[columns.size()];
-    for (int c = 0; c < newConverters.length; ++c) {
-      newConverters[c] = buildConverter(columns.get(c));
-    }
-    return newConverters;
+  @SuppressWarnings("unchecked")
+  public SparkOrcReader(
+      org.apache.iceberg.Schema expectedSchema, TypeDescription readOrcSchema, Map<Integer, ?> idToConstant) {
+    reader = (SparkOrcValueReaders.StructReader) OrcSchemaWithTypeVisitor.visit(
+        expectedSchema, readOrcSchema, new ReadBuilder(idToConstant));
   }
 
   @Override
   public InternalRow read(VectorizedRowBatch batch, int row) {
-    rowWriter.reset();
-    rowWriter.zeroOutNullBytes();
-    for (int c = 0; c < batch.cols.length; ++c) {
-      converters[c].convert(rowWriter, c, batch.cols[c], row);
-    }
-    return rowWriter.getRow();
+    return reader.read(batch, row);

Review comment:
       It worked!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417797509



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {

Review comment:
       `GenericOrcReader` can share a lot of the code here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421668822



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {

Review comment:
       Consider renaming this to something that makes it more obvious that it is returning `UTF8String`. Avro has a factory method named `utf8s` to distinguish between strings and its custom `Utf8` class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429082701



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {

Review comment:
       It worked this time.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421672821



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       The Spark/Parquet reader doesn't currently reuse containers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421650510



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.List;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+
+
+public abstract class OrcSchemaWithTypeVisitor<T> {
+  public static <T> T visit(
+      org.apache.iceberg.Schema iSchema, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    return visit(iSchema.asStruct(), schema, visitor);
+  }
+
+  public static <T> T visit(Type iType, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    switch (schema.getCategory()) {
+      case STRUCT:
+        return visitRecord(iType.asStructType(), schema, visitor);
+
+      case UNION:
+        // We don't have an answer for union types.
+        throw new IllegalArgumentException("Can't handle " + schema);
+
+      case LIST:
+        Types.ListType list = iType.asListType();
+        return visitor.array(

Review comment:
       In other visitors, we try to name the method after the type in the schema that is being visited. That's why Avro uses `array` but Iceberg uses `list`. Since the category for ORC is `LIST`, should the visitor method be named `list`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] shardulm94 commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r428432880



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.List;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+
+
+public abstract class OrcSchemaWithTypeVisitor<T> {
+  public static <T> T visit(
+      org.apache.iceberg.Schema iSchema, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    return visit(iSchema.asStruct(), schema, visitor);
+  }
+
+  public static <T> T visit(Type iType, TypeDescription schema, OrcSchemaWithTypeVisitor<T> visitor) {
+    switch (schema.getCategory()) {
+      case STRUCT:
+        return visitRecord(iType.asStructType(), schema, visitor);

Review comment:
       iType can be null if a corresponding field is not found in the Iceberg schema for the current field in the ORC schema. So we should put a null check here and in other places where Type.asXXX() is being done

##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -308,7 +309,7 @@ private static TypeDescription buildOrcProjection(Integer fieldId, Type type, bo
           orcType = convert(fieldId, type, false);
         }
     }
-
+    orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());

Review comment:
       For completeness sake, also set ICEBERG_REQUIRED_ATTRIBUTE?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421674571



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -109,23 +104,8 @@
     Iterator<InternalRow> iter;
 
     if (hasJoinedPartitionColumns) {
-      if (SUPPORTS_CONSTANTS.contains(file.format())) {
-        iterSchema = requiredSchema;
-        iter = open(task, requiredSchema, PartitionUtil.constantsMap(task, RowDataReader::convertConstant));
-      } else {
-        // schema used to read data files
-        Schema readSchema = TypeUtil.selectNot(requiredSchema, idColumns);
-        Schema partitionSchema = TypeUtil.select(requiredSchema, idColumns);
-        PartitionRowConverter convertToRow = new PartitionRowConverter(partitionSchema, spec);
-        JoinedRow joined = new JoinedRow();
-
-        InternalRow partition = convertToRow.apply(file.partition());
-        joined.withRight(partition);
-
-        // create joined rows and project from the joined schema to the final schema
-        iterSchema = TypeUtil.join(readSchema, partitionSchema);
-        iter = Iterators.transform(open(task, readSchema, ImmutableMap.of()), joined::withLeft);
-      }
+      iterSchema = requiredSchema;
+      iter = open(task, requiredSchema, PartitionUtil.constantsMap(task, RowDataReader::convertConstant));

Review comment:
       Once this and #1004 are in, we can remove the Spark projection!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429051777



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private InternalRow internalRow;

Review comment:
       In this case the reuse is only being done for the top level row and not for any internal row. Is this still a problem?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429083528



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -308,7 +309,7 @@ private static TypeDescription buildOrcProjection(Integer fieldId, Type type, bo
           orcType = convert(fieldId, type, false);
         }
     }
-
+    orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());

Review comment:
       Adding this is actually causing failures
   ```
   org.apache.iceberg.data.orc.TestGenericReadProjection > testRenamedAddedField FAILED
       java.lang.IllegalArgumentException: No conversion of type LONG to self needed
           at org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1671)
           at org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2124)
           at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2331)
           at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1961)
           at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2371)
           at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:227)
           at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:752)
           at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:80)
           at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:65)
           at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:254)
           at org.apache.iceberg.data.orc.TestGenericReadProjection.writeAndRead(TestGenericReadProjection.java:53)
   ```
   And I vaguely remember we fixed a similar bug before in ORC




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r418578291



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       I rerun the ORC jmh tests and do see any difference.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       I reran the ORC jmh tests and do see any difference.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421673301



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private InternalRow internalRow;

Review comment:
       As I noted above, the reader can't keep track of a row like this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-632779817


   Weird. I thought I did resolve the conflicts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr edited a comment on pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr edited a comment on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-622246098


   I reran the Spark jmh tests which @shardulm94  had written on a fresh checkout of iceberg and against my patch. Here's the output
   
   Fresh checkout of Iceberg
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op
   ```
   
   With the ORC nested partition patch applied
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op
   ```
   
   Seems like there isn't much difference.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417797286



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+
+
+public interface OrcValReader<T> {

Review comment:
       I want to rename this to OrcValueReader, but an interface with that name already exists. I though I'd rename the other interface OrcRowReader instead. If folks are OK doing this as part of this RB, let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421671989



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();

Review comment:
       Container reuse requires passing objects into the read method for correctness.
   
   The contract for container reuse is that the caller will consume the record before asking for the next one. For example, Spark might copy the entire row into an UnsafeRow.
   
   The problem with keeping a reused list or struct in the reader itself is that the reader might be called more than once to produce a value before the row is consumed. For example, a list of structs `locations list<struct<lat:double,long:double>>` will call the inner reader for each location struct in the list before returning the record. If the reader reuses the struct, then the same struct will be added to the list multiple times and all of them will have the last values set in the reused struct.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421665455



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {
+      return readInternal(reuseOrCreate(), batch.cols, row);
+    }
+
+    private T readInternal(T struct, ColumnVector[] columnVectors, int row) {
+      for (int c = 0; c < readers.length; ++c) {
+        set(struct, c, reader(c).read(columnVectors[c], row));

Review comment:
       You might consider a different approach. This currently mirrors what happens in Avro, where the constants are set after reading a record. That is done because Avro can't skip fields easily -- it needs to read through a value even if the value won't be used.
   
   But columnar formats can easily skip. That's why in Parquet, we replace the column reader with a constant reader. So the struct reader behaves exactly like normal and reads a value from every child reader. But some of those children might ignore what's in the data file and return a constant. That should be more efficient because you're not materializing columns you don't need to.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421653711



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+
+
+public interface OrcValReader<T> {

Review comment:
       That works for me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-622246098


   I reran the Spark jmh tests which @shardulm94  had written on a fresh checkout of iceberg and against my patch. Here's the output
   
   Fresh checkout of Iceberg
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op
   ```
   
   With the ORC nested partition patch applied
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-632787691


   @rdblue . Fixed conflicts and filed all the necessary followups.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r419017570



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.orc.OrcValReader;
+import org.apache.iceberg.orc.OrcValueReaders;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DecimalColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.TimestampColumnVector;
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+
+
+class SparkOrcValueReaders {
+  private SparkOrcValueReaders() {
+  }
+
+  static OrcValReader<UTF8String> strings() {
+    return StringReader.INSTANCE;
+  }
+
+  static OrcValReader<?> timestampTzs() {
+    return TimestampTzReader.INSTANCE;
+  }
+
+  static OrcValReader<?> struct(
+      List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+    return new StructReader(readers, struct, idToConstant);
+  }
+
+  static OrcValReader<?> array(OrcValReader<?> elementReader) {
+    return new ArrayReader(elementReader);
+  }
+
+  static OrcValReader<?> map(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+    return new MapReader(keyReader, valueReader);
+  }
+
+  private static class ArrayReader implements OrcValReader<ArrayData> {
+    private final OrcValReader<?> elementReader;
+    private final List<Object> reusedList = Lists.newArrayList();
+
+    private ArrayReader(OrcValReader<?> elementReader) {
+      this.elementReader = elementReader;
+    }
+
+    @Override
+    public ArrayData nonNullRead(ColumnVector vector, int row) {
+      reusedList.clear();
+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      for (int c = 0; c < length; ++c) {
+        reusedList.add(elementReader.read(listVector.child, offset + c));
+      }
+      return new GenericArrayData(reusedList.toArray());
+    }
+  }
+
+  private static class MapReader implements OrcValReader<MapData> {
+    private final OrcValReader<?> keyReader;
+    private final OrcValReader<?> valueReader;
+
+    private final List<Object> reusedKeyList = Lists.newArrayList();
+    private final List<Object> reusedValueList = Lists.newArrayList();
+
+    private MapReader(OrcValReader<?> keyReader, OrcValReader<?> valueReader) {
+      this.keyReader = keyReader;
+      this.valueReader = valueReader;
+    }
+
+    @Override
+    public MapData nonNullRead(ColumnVector vector, int row) {
+      reusedKeyList.clear();
+      reusedValueList.clear();
+      MapColumnVector mapVector = (MapColumnVector) vector;
+      int offset = (int) mapVector.offsets[row];
+      long length = mapVector.lengths[row];
+      for (int c = 0; c < length; c++) {
+        reusedKeyList.add(keyReader.read(mapVector.keys, offset + c));
+        reusedValueList.add(valueReader.read(mapVector.values, offset + c));
+      }
+
+      return new ArrayBasedMapData(
+          new GenericArrayData(reusedKeyList.toArray()),
+          new GenericArrayData(reusedValueList.toArray()));
+    }
+  }
+
+  static class StructReader extends OrcValueReaders.StructReader<InternalRow> {
+    private final int numFields;
+    private final InternalRow internalRow;
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      super(readers, struct, idToConstant);
+      this.numFields = readers.size();
+      internalRow = createOrReuse();
+    }
+
+    @Override
+    protected InternalRow createOrReuse() {
+      return new GenericInternalRow(numFields);

Review comment:
       Do or do not?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421655358



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {

Review comment:
       Similar to short, do we need this reader?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421662660



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {

Review comment:
       These factory methods can declare the type they return, like `OrcValueReader<byte[]>`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r428299279



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -214,6 +214,7 @@ public static Schema convert(TypeDescription orcSchema) {
         "Error in ORC file, children fields and names do not match.");
 
     List<Types.NestedField> icebergFields = Lists.newArrayListWithExpectedSize(children.size());
+    // TODO how we get field ids from ORC schema

Review comment:
       Should we use another PR to fix this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr removed a comment on pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr removed a comment on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-622412537


   > I reran the Spark jmh tests which @shardulm94 had written on a fresh checkout of iceberg and against my patch. Here's the output
   > 
   > Fresh checkout of Iceberg
   > 
   > ```
   > Benchmark                                                                          Mode  Cnt   Score   Error  Units
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op
   > ```
   > 
   > With the ORC nested partition patch applied
   > 
   > ```
   > Benchmark                                                                          Mode  Cnt   Score   Error  Units
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
   > IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op
   > ```
   > 
   > Seems like there isn't much difference.
   
   cc @shardulm94 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421658103



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();

Review comment:
       Why not pass the possibly reused object in here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r417797286



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java
##########
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+
+
+public interface OrcValReader<T> {

Review comment:
       I want to rename this to OrcValueReader, but an interface with that name already exists. I can rename the other interface `OrcRowReader` instead. If folks are OK doing this as part of this RB, let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] Orc nested partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-621867088


   cc @edgarRd 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r421665947



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
##########
@@ -53,425 +40,83 @@
  * It minimizes allocations by reusing most of the objects in the implementation.
  */
 public class SparkOrcReader implements OrcValueReader<InternalRow> {
-  private static final int INITIAL_SIZE = 128 * 1024;
-  private final List<TypeDescription> columns;
-  private final Converter[] converters;
-  private final UnsafeRowWriter rowWriter;
+  private final SparkOrcValueReaders.StructReader reader;
 
-  public SparkOrcReader(TypeDescription readOrcSchema) {
-    columns = readOrcSchema.getChildren();
-    converters = buildConverters();
-    rowWriter = new UnsafeRowWriter(columns.size(), INITIAL_SIZE);
+  public SparkOrcReader(org.apache.iceberg.Schema expectedSchema, TypeDescription readSchema) {
+    this(expectedSchema, readSchema, ImmutableMap.of());
   }
 
-  private Converter[] buildConverters() {
-    final Converter[] newConverters = new Converter[columns.size()];
-    for (int c = 0; c < newConverters.length; ++c) {
-      newConverters[c] = buildConverter(columns.get(c));
-    }
-    return newConverters;
+  @SuppressWarnings("unchecked")
+  public SparkOrcReader(
+      org.apache.iceberg.Schema expectedSchema, TypeDescription readOrcSchema, Map<Integer, ?> idToConstant) {
+    reader = (SparkOrcValueReaders.StructReader) OrcSchemaWithTypeVisitor.visit(

Review comment:
       Minor: we prefer to use `this.reader =` when assigning to instance fields so it's clear that it is setting a field.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r420996718



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -214,6 +214,7 @@ public static Schema convert(TypeDescription orcSchema) {
         "Error in ORC file, children fields and names do not match.");
 
     List<Types.NestedField> icebergFields = Lists.newArrayListWithExpectedSize(children.size());
+    // TODO how we get field ids from ORC schema

Review comment:
       I just noticed the logic here and it's a correctness bug. ORC should not assign column IDs when one is missing. Instead, it should ignore the field.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr edited a comment on pull request #989: [WIP] Orc nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr edited a comment on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-622246098


   I reran the Spark jmh tests which @shardulm94  had written on a fresh checkout of iceberg and against my patch. Here's the output
   
   Fresh checkout of Iceberg
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op
   ```
   
   With the ORC nested partition patch applied
   ```
   Benchmark                                                                          Mode  Cnt   Score   Error  Units
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op
   ```
   
   Seems like there isn't much difference. cc @shardulm94 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#issuecomment-631688885


   thanks @rdblue . Picking this up again. Should address ur comments soon


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #989: [WIP] ORC nested Identity partition support

Posted by GitBox <gi...@apache.org>.

rdsr commented on a change in pull request #989:
URL: https://github.com/apache/incubator-iceberg/pull/989#discussion_r429082606



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import com.google.common.collect.Lists;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.storage.ql.exec.vector.BytesColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
+import org.apache.orc.storage.ql.exec.vector.LongColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+public class OrcValueReaders {
+  private OrcValueReaders() {
+  }
+
+  public static OrcValReader<?> booleans() {
+    return BooleanReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> shorts() {
+    return ShortReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> ints() {
+    return IntegerReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> longs() {
+    return LongReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> floats() {
+    return FloatReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> doubles() {
+    return DoubleReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> bytes() {
+    return BytesReader.INSTANCE;
+  }
+
+  public static OrcValReader<?> byteReader() {
+    return ByteReader.INSTANCE;
+  }
+
+  private static class BooleanReader implements OrcValReader<Boolean> {
+    static final OrcValReader<?> INSTANCE = new BooleanReader();
+
+    private BooleanReader() {
+    }
+
+    @Override
+    public Boolean nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row] != 0;
+    }
+  }
+
+  private static class ShortReader implements OrcValReader<Short> {
+    static final OrcValReader<?> INSTANCE = new ShortReader();
+
+    private ShortReader() {
+    }
+
+    @Override
+    public Short nonNullRead(ColumnVector vector, int row) {
+      return (short) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class IntegerReader implements OrcValReader<Integer> {
+    static final OrcValReader<?> INSTANCE = new IntegerReader();
+
+    private IntegerReader() {
+    }
+
+    @Override
+    public Integer nonNullRead(ColumnVector vector, int row) {
+      return (int) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class LongReader implements OrcValReader<Long> {
+    static final OrcValReader<?> INSTANCE = new LongReader();
+
+    private LongReader() {
+    }
+
+    @Override
+    public Long nonNullRead(ColumnVector vector, int row) {
+      return ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class FloatReader implements OrcValReader<Float> {
+    private static final FloatReader INSTANCE = new FloatReader();
+
+    private FloatReader() {
+    }
+
+    @Override
+    public Float nonNullRead(ColumnVector vector, int row) {
+      return (float) ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class DoubleReader implements OrcValReader<Double> {
+    private static final DoubleReader INSTANCE = new DoubleReader();
+
+    private DoubleReader() {
+    }
+
+    @Override
+    public Double nonNullRead(ColumnVector vector, int row) {
+      return ((DoubleColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class ByteReader implements OrcValReader<Byte> {
+    private static final ByteReader INSTANCE = new ByteReader();
+
+    private ByteReader() {
+    }
+
+    @Override
+    public Byte nonNullRead(ColumnVector vector, int row) {
+      return (byte) ((LongColumnVector) vector).vector[row];
+    }
+  }
+
+  private static class BytesReader implements OrcValReader<byte[]> {
+    private static final BytesReader INSTANCE = new BytesReader();
+
+    private BytesReader() {
+    }
+
+    @Override
+    public byte[] nonNullRead(ColumnVector vector, int row) {
+      BytesColumnVector bytesVector = (BytesColumnVector) vector;
+
+      return Arrays.copyOfRange(
+          bytesVector.vector[row], bytesVector.start[row], bytesVector.start[row] + bytesVector.length[row]);
+    }
+  }
+
+  public abstract static class StructReader<T> implements OrcValReader<T> {
+    private final OrcValReader<?>[] readers;
+    private final int[] positions;
+    private final Object[] constants;
+
+    protected StructReader(List<OrcValReader<?>> readers) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      this.positions = new int[0];
+      this.constants = new Object[0];
+    }
+
+    protected StructReader(List<OrcValReader<?>> readers, Types.StructType struct, Map<Integer, ?> idToConstant) {
+      this.readers = readers.toArray(new OrcValReader[0]);
+      List<Types.NestedField> fields = struct.fields();
+      List<Integer> positionList = Lists.newArrayListWithCapacity(fields.size());
+      List<Object> constantList = Lists.newArrayListWithCapacity(fields.size());
+      for (int pos = 0; pos < fields.size(); pos += 1) {
+        Types.NestedField field = fields.get(pos);
+        Object constant = idToConstant.get(field.fieldId());
+        if (constant != null) {
+          positionList.add(pos);
+          constantList.add(idToConstant.get(field.fieldId()));
+        }
+      }
+
+      this.positions = positionList.stream().mapToInt(Integer::intValue).toArray();
+      this.constants = constantList.toArray();
+    }
+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();
+
+    protected abstract void set(T struct, int pos, Object value);
+
+    public OrcValReader<?> reader(int pos) {
+      return readers[pos];
+    }
+
+    @Override
+    public T nonNullRead(ColumnVector vector, int row) {
+      StructColumnVector structVector = (StructColumnVector) vector;
+      return readInternal(create(), structVector.fields, row);
+    }
+
+    public T read(VectorizedRowBatch batch, int row) {
+      return readInternal(reuseOrCreate(), batch.cols, row);
+    }
+
+    private T readInternal(T struct, ColumnVector[] columnVectors, int row) {
+      for (int c = 0; c < readers.length; ++c) {
+        set(struct, c, reader(c).read(columnVectors[c], row));

Review comment:
       Is is ok if I tackle this in followup?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org