You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/24 10:51:52 UTC

[GitHub] [iceberg] pvary opened a new pull request #4218: Core: Improve GenericReader performance

pvary opened a new pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218


   @rbalamohan identified that:
   > the computation of "get" in "Caffeine" cache itself is expensive with operations like lookups, afterreads, stats updates, fork-join for purging etc.
   
   This causes running `GenericRecord.create()` for every record expensive.
   Sometimes we can not reuse the containers in readers, but we still need better performance. If we create a template record in readers and copy this template record then we can avoid the cache retrieval of the `nameToPos` map.
   
   Created some basic jmh performance tests, and here are the results:
   ```
   benchmark-result.txt.orc.base:  mean =      2.571 ±(99.9%) 0.172 s/op
   benchmark-result.txt.orc.fix:  mean =      2.297 ±(99.9%) 0.129 s/op
   benchmark-result.txt.parquet.base:  mean =      2.649 ±(99.9%) 0.136 s/op
   benchmark-result.txt.parquet.fix:  mean =      2.209 ±(99.9%) 0.093 s/op
   ```
   
   We can see that the change gained ~10-20% performance for the generic readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817580893



##########
File path: data/src/jmh/java/org/apache/iceberg/ReaderBenchmark.java
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.data.RandomGenericData;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.types.Types;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Threads;
+import org.openjdk.jmh.annotations.Warmup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@Fork(1)
+@State(Scope.Benchmark)
+@Warmup(iterations = 3)
+@Measurement(iterations = 20)
+@BenchmarkMode(Mode.SingleShotTime)
+public abstract class ReaderBenchmark {
+  private static final Logger LOG = LoggerFactory.getLogger(ReaderBenchmark.class);
+
+  private static final Schema TEST_SCHEMA = new Schema(
+      required(1, "longCol", Types.LongType.get()),
+      required(2, "intCol", Types.IntegerType.get()),
+      required(3, "floatCol", Types.FloatType.get()),
+      optional(4, "doubleCol", Types.DoubleType.get()),
+      optional(5, "decimalCol", Types.DecimalType.of(20, 5)),
+      optional(6, "dateCol", Types.DateType.get()),
+      optional(7, "timestampCol", Types.TimestampType.withZone()),
+      optional(8, "stringCol", Types.StringType.get()));
+
+  private static final int NUM_ROWS = 2500000;
+  private static final int SEED = -1;
+
+  private File testFile;
+
+  @Setup
+  public void setupBenchmark() throws IOException {
+    testFile = java.nio.file.Files.createTempFile("perf-bench", null).toFile();

Review comment:
       Good catch!
   I was using `iceberg.Files` in previous version of the test, so fully qualified import was needed, but not anymore.
   fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] marton-bod commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r813956289



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       Can this result in NPE?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r813961425



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       Then there were an NPE with the original code as well. The template is only null, when the type is null.
   ```
     private GenericRecord(StructType struct) {
       this.struct = struct;
       this.size = struct.fields().size();         <-- NPE here when the struct is null
       this.values = new Object[size];
       this.nameToPos = NAME_MAP_CACHE.get(struct);
     }
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1051728519


   Seems like @aokolnychyi is occupied otherwise. @RussellSpitzer, or @shardulm94, any chance of taking a look?
   Thanks, Peter 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817599665



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       There were an issue with positional deletes where the row was not present in the file. The code below still created reader for the empty struct.
   ```
         Types.StructType struct = iType != null ? iType.asStructType() : null;
         return visitor.struct(struct, group, visitFields(struct, group, visitor));
   ```
   
   Changed it to this, let's see if it breaks something:
   ```
         return iType != null ?
                 visitor.struct(iType.asStructType(), group, visitFields(iType.asStructType(), group, visitor)) : null;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817600031



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       Shall we remove this file? Or even the one with the spark performance tests?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r818257615



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I'm pretty sure we don't need to add this `benchmark` and `benchmark/.gitkeep` to git now.  I just removed this directory and run the jmh command from @pvary , everything is okay and in fact it will just generate the new benchmark directory for the testing report.
   
   @pvary I think we will need to add this `data/benchmark` to the .gitignore file. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1059132469


   Thanks everyone for the review and the merge!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r818136288



##########
File path: parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java
##########
@@ -53,21 +53,23 @@ private GenericParquetReaders() {
   }
 
   private static class RecordReader extends StructReader<Record, Record> {
-    private final StructType structType;
+    private final GenericRecord template;
 
     RecordReader(List<Type> types,
                  List<ParquetValueReader<?>> readers,
                  StructType struct) {
       super(types, readers);
-      this.structType = struct;
+      this.template = struct != null ? GenericRecord.create(struct) : null;

Review comment:
       Should these also have preconditions that validate the struct is not null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817498679



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       Indeed, the `type` should never be null here because the `SchemaWithPartnerVistor` guarantee it. See: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/schema/SchemaWithPartnerVisitor.java#L52-L61
   
   I think we can just add a `Precondition.checkNotNull` in the line128 instead of checking the null case in a `if-else` block.

##########
File path: data/src/jmh/java/org/apache/iceberg/ReaderBenchmark.java
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.data.RandomGenericData;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.types.Types;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Threads;
+import org.openjdk.jmh.annotations.Warmup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@Fork(1)
+@State(Scope.Benchmark)
+@Warmup(iterations = 3)
+@Measurement(iterations = 20)
+@BenchmarkMode(Mode.SingleShotTime)
+public abstract class ReaderBenchmark {
+  private static final Logger LOG = LoggerFactory.getLogger(ReaderBenchmark.class);
+
+  private static final Schema TEST_SCHEMA = new Schema(
+      required(1, "longCol", Types.LongType.get()),
+      required(2, "intCol", Types.IntegerType.get()),
+      required(3, "floatCol", Types.FloatType.get()),
+      optional(4, "doubleCol", Types.DoubleType.get()),
+      optional(5, "decimalCol", Types.DecimalType.of(20, 5)),
+      optional(6, "dateCol", Types.DateType.get()),
+      optional(7, "timestampCol", Types.TimestampType.withZone()),
+      optional(8, "stringCol", Types.StringType.get()));
+
+  private static final int NUM_ROWS = 2500000;
+  private static final int SEED = -1;
+
+  private File testFile;
+
+  @Setup
+  public void setupBenchmark() throws IOException {
+    testFile = java.nio.file.Files.createTempFile("perf-bench", null).toFile();

Review comment:
       Nit:  Could we import this as a package, instead of using the full qualifier name here ? 

##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       So here we use `GenericRecord(GenericRecord toCopy)` constructor rather than `GenericRecord(StructType struct)` to eliminate the access to `NAME_MAP_CACHE` ?  I'm afraid the further PR will easily fallback to use the other because this tiny different is hard for the next developer and reviewer to notice.  Is possible to add few doc ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r818508385



##########
File path: parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java
##########
@@ -53,21 +53,23 @@ private GenericParquetReaders() {
   }
 
   private static class RecordReader extends StructReader<Record, Record> {
-    private final StructType structType;
+    private final GenericRecord template;
 
     RecordReader(List<Type> types,
                  List<ParquetValueReader<?>> readers,
                  StructType struct) {
       super(types, readers);
-      this.structType = struct;
+      this.template = struct != null ? GenericRecord.create(struct) : null;

Review comment:
       My original change did not contain the null check, and there were failing tests (org.apache.iceberg.TestGenericAppenderFactory > testPosDeleteWriterWithRowSchema[FileFormat=parquet, Partitioned=true] FAILED) with the exception below.
   
   In nutshell, when we are reading the positional delete files we either have `row` in the data or we don't. I have tried to fix this by returning a `null` reader instead of creating a reader for a `null` struct. But this caused issues with Spark projection where we create the reader for the full schema, and having a `null` or missing reader caused issues when we tried to match the readers to the projected schema.
   Fixing this is not trivial so I decided to move forward with the null checks to keep the backward compatibility:
   - It should possible to create a reader with a `null` schema
   - Fail only when we try to actually read from this reader
   
   ```
       java.lang.NullPointerException
           at org.apache.iceberg.data.GenericRecord.<init>(GenericRecord.java:63)
           at org.apache.iceberg.data.GenericRecord.create(GenericRecord.java:53)
           at org.apache.iceberg.data.parquet.GenericParquetReaders$RecordReader.<init>(GenericParquetReaders.java:62)
           at org.apache.iceberg.data.parquet.GenericParquetReaders.createStructReader(GenericParquetReaders.java:52)
           at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:176)
           at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:114)
           at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:141)
           at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:169)
           at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:183)
           at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)
           at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67)
           at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:58)
           at org.apache.iceberg.data.parquet.GenericParquetReaders.buildReader(GenericParquetReaders.java:41)
           at org.apache.iceberg.data.DeleteFilter.lambda$openDeletes$6(DeleteFilter.java:255)
           at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:118)
           at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:66)
           at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:77)
           at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:38)
           at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:35)
           at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:73)
           at org.apache.iceberg.io.CloseableIterable$4$1.<init>(CloseableIterable.java:99)
           at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:152)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:129)
           at java.lang.Iterable.forEach(Iterable.java:74)
           at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:97)
           at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:91)
           at org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:231)
           at org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:126)
           at org.apache.iceberg.data.GenericReader.open(GenericReader.java:77)
           at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:826)
           at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:151)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
           at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
           at org.apache.iceberg.data.GenericReader.open(GenericReader.java:65)
           at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:41)
           at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:29)
           at java.lang.Iterable.forEach(Iterable.java:74)
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817619563



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I am not sure why we want to create and keep this directory 😢
   I have seen that we run the tests like this:
   ```
   ./gradlew cleanTest :iceberg-data:jmh -PjmhIncludeRegex=GenericOrcReaderBenchmark -PjmhOutputPath=benchmark/benchmark-result.txt.orc
   ```
   
   But I am not sure that using the directory is needed for some CI, or other tool, or not




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817477319



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I copied what we have here:
   https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/benchmark/.gitkeep
   
   The file is to have a directory when checking out the files, but I am not really sure why we do this.
   
   We might want to delete this from both places - I was hoping someone with more knowledge about the current test could comment on this.
   
   Thanks for the review!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r818134335



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       Yeah, this seems like a temporary directory that we shouldn't need to have in git.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx merged pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

openinx merged pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1049806840


   @aokolnychyi: Could you please review when you have some time? I see that you created the other jmh tests, so you might be the best person to take a look.
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1049732294


   CC: @rbalamohan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Voyager-Wang commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

Voyager-Wang commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817617904



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       The .gitkeep file is to track the empty directory, since a completely empty directory can't be added into Git.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ben-manes commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

ben-manes commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1057240621

https://github.com/apache/iceberg/blob/c45b8b4c6e4d3e0ff5e7ce8ea4ed7296d6e1b12d/core/src/main/java/org/apache/iceberg/data/GenericRecord.java#L36-L46

Note that this uses identity (`==`) equality for key lookup. If you are using equivalent (`.equals(o)`) instances instead then this will be a cache miss and the cache won't be providing much value. Since it is weak keys this seems like a likely mistake and of course lets the GC collect aggressively, so you might not get a lot of reuse. It is often a helpful feature for cases like caching request-scoped data, and I'm not familiar enough with the code base to assess if this is a problematic usage. Typically a bounded cache is better and more explicit, but does require some mental effort to decide on a decent maximum size.

Otherwise if should perform similarly to a ConcurrentHashMap from the caller's perspective.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817600703



##########
File path: data/src/jmh/java/org/apache/iceberg/ReaderBenchmark.java
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.data.RandomGenericData;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.types.Types;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Threads;
+import org.openjdk.jmh.annotations.Warmup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@Fork(1)
+@State(Scope.Benchmark)
+@Warmup(iterations = 3)
+@Measurement(iterations = 20)
+@BenchmarkMode(Mode.SingleShotTime)
+public abstract class ReaderBenchmark {
+  private static final Logger LOG = LoggerFactory.getLogger(ReaderBenchmark.class);
+
+  private static final Schema TEST_SCHEMA = new Schema(
+      required(1, "longCol", Types.LongType.get()),
+      required(2, "intCol", Types.IntegerType.get()),
+      required(3, "floatCol", Types.FloatType.get()),
+      optional(4, "doubleCol", Types.DoubleType.get()),
+      optional(5, "decimalCol", Types.DecimalType.of(20, 5)),
+      optional(6, "dateCol", Types.DateType.get()),
+      optional(7, "timestampCol", Types.TimestampType.withZone()),
+      optional(8, "stringCol", Types.StringType.get()));
+
+  private static final int NUM_ROWS = 2500000;
+  private static final int SEED = -1;
+
+  private File testFile;
+
+  @Setup
+  public void setupBenchmark() throws IOException {
+    testFile = java.nio.file.Files.createTempFile("perf-bench", null).toFile();

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1055157212


   @openinx or @rdblue? 
   This is a small change and it would save ~10-20% performance for the generic readers.
   I would like to have a review, so I can merge.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817587138



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java
##########
@@ -125,13 +125,14 @@ public FieldDeserializer primitive(PrimitiveType type, ObjectInspectorPair pair)
 
     @Override
     public FieldDeserializer struct(StructType type, ObjectInspectorPair pair, List<FieldDeserializer> deserializers) {
+      GenericRecord template = type != null ? GenericRecord.create(type) : null;
       return o -> {
         if (o == null) {
           return null;
         }
 
         List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o);
-        Record result = GenericRecord.create(type);
+        Record result = template.copy();

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r817474210



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       What's the purpose to add this file ?  Actually, I don't know what's the usage of the .gitkeep file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1057005763


   @openinx: Fixed the null check where I was able to, added the comments, and the tests are green. If you have some time, I think the PR is ready for another review. Thanks again for taking the time to check this out!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#issuecomment-1057879315


   @ben-manes:
   > Note that this uses identity (`==`) equality for key lookup.
   
   We are reusing the `struct` objects, so `==` should be fine. @rbalamohan checked and he wrote:
   > Probed further:   It doesn't look like a cache miss, but the computation of "get" in "Caffeine" cache itself is expensive with operations like lookups, afterreads, stats updates, fork-join for purging etc.  
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #4218: Core: Improve GenericReader performance

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #4218:
URL: https://github.com/apache/iceberg/pull/4218#discussion_r818494174



##########
File path: data/benchmark/.gitkeep
##########
@@ -0,0 +1,18 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       Removed the `.gitkeep` files, and fixed the `.gitignore` accordingly




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org