You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/15 19:40:42 UTC

[GitHub] [iceberg] omaraloraini opened a new pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

omaraloraini opened a new pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336


   `UnboxedReader` conflates two things: 1) the source type of the column. 2) the target type. You can't distinguish between its methods as to which one should the client code use. In a later PR I will migrate other usages of `UnboxedReader`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-809826433


   @rdblue, I did some other test and things seems to have regressed, I'm not confident in the change, and introducing more public API surface without a lot of added value is making me hesitant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini edited a comment on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini edited a comment on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804491609


   Old Ran 10 trials: mean time: 5781.100 ms, stddev: 78.045 ms
   New without setter: Ran 10 trials: mean time: 5166.300 ms, stddev: 61.861 ms. [commit](https://github.com/omaraloraini/iceberg/commit/69a11fd7d99e161fc4e3a997caecef0f601ba114).
   
   I update the schema in the test with more required and primitive columns, new schema:
   ```
       Schema structSchema = new Schema(
               required(1, "circumvent", Types.LongType.get()),
               optional(2, "antarctica", Types.StringType.get()),
               optional(3, "fluent", Types.DoubleType.get()),
               required(4, "quell", Types.StructType.of(
                       required(5, "operator", Types.BooleanType.get()),
                       optional(6, "fanta", Types.IntegerType.get()),
                       optional(7, "cable", Types.FloatType.get())
               )),
               required(8, "chimney", Types.TimestampType.withZone()),
               required(9, "wool", Types.DateType.get()),
               required(10, "c1", Types.LongType.get()),
               required(11, "c2", Types.LongType.get()),
               required(12, "c3", Types.LongType.get()),
               required(13, "c4", Types.IntegerType.get()),
               required(14, "c5", Types.IntegerType.get()),
               required(15, "c6", Types.IntegerType.get()),
               required(16, "c7", Types.FloatType.get()),
               required(17, "c8", Types.FloatType.get()),
               required(18, "c9", Types.DoubleType.get()),
               required(19, "c10", Types.DoubleType.get())
       );
   ```
   Result with new schema:
   Old: Ran 10 trials: mean time: 7528.300 ms, stddev: 97.378 ms
   New without setter: Ran 10 trials: mean time: 9742.600 ms, stddev: 97.785 ms
   New with setter!: Ran 10 trials: mean time: 3939.400 ms, stddev: 98.521 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-805246901


   @omaraloraini, nice work! I think that the "new with setter" option looks promising! While it's true that Spark and Flink will currently store the data in `Object[]` that doesn't mean that we won't support an object model that handles primitives. And even then, the Parquet/Avro code will use Avro generics that also store values in `Object[]` and that appears to be much faster!
   
   What do you think about re-opening this and updating it to "new with setter" so we can take advantage of any performance improvements we might see? We can also copy the micro-benchmark you're running over to Spark and Flink to see if it does work there as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804491609


   
   Old Ran 10 trials: mean time: 5781.100 ms, stddev: 78.045 ms
   New without setter: Ran 10 trials: mean time: 5166.300 ms, stddev: 61.861 ms. [commit](https://github.com/omaraloraini/iceberg/commit/69a11fd7d99e161fc4e3a997caecef0f601ba114).
   
   I update the schema in the test to contain with more required and primitive columns, new schema:
   ```
       Schema structSchema = new Schema(
               required(1, "circumvent", Types.LongType.get()),
               optional(2, "antarctica", Types.StringType.get()),
               optional(3, "fluent", Types.DoubleType.get()),
               required(4, "quell", Types.StructType.of(
                       required(5, "operator", Types.BooleanType.get()),
                       optional(6, "fanta", Types.IntegerType.get()),
                       optional(7, "cable", Types.FloatType.get())
               )),
               required(8, "chimney", Types.TimestampType.withZone()),
               required(9, "wool", Types.DateType.get()),
               required(10, "c1", Types.LongType.get()),
               required(11, "c2", Types.LongType.get()),
               required(12, "c3", Types.LongType.get()),
               required(13, "c4", Types.IntegerType.get()),
               required(14, "c5", Types.IntegerType.get()),
               required(15, "c6", Types.IntegerType.get()),
               required(16, "c7", Types.FloatType.get()),
               required(17, "c8", Types.FloatType.get()),
               required(18, "c9", Types.DoubleType.get()),
               required(19, "c10", Types.DoubleType.get())
       );
   ```
   Result with new schema:
   Old: Ran 10 trials: mean time: 7528.300 ms, stddev: 97.378 ms
   New without setter: Ran 10 trials: mean time: 9742.600 ms, stddev: 97.785 ms
   New with setter!: Ran 10 trials: mean time: 3939.400 ms, stddev: 98.521 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804521418


   The difference is not worth the complexity as all readers(Spark, Flink) store data in an `Object[]`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-800421401


   I ran the failing test locally and it succeeded, not sure why it's failing. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804451815


   @rdblue, I really could not experiment with any change, because I don't have any numbers to compare it to. I'm still trying to figure out how to test it, I don't think I can conclude from profiling alone that it performs better using  the number of samples. I tried to run some of JMH benchmarks in Iceberg repo, sometimes they work, other times I get an error related to shadowJar and jmh plugins, some error with `META-INF`. I tried to use JMH in my code base, but for some reason I could not get it to work.
   
   Note that non of the existing frameworks in the code base use it, so if you ran some test you would get the same numbers, in particular the following files use the `UnboxedReader`:
   
   - FlinkParquetReaders.ReadBuilder
   - BaseParquetReaders.ReadBuilder
   - ParquetAvroValueReaders.ReadBuilder
   - PigParquetReader.ReadBuilder
   - SparkParquetReaders.ReadBuilder
   
   My plan was to get the API first, then update them afterwards. I will update them soon.
   
   Regarding the flame graphs, this is the old one with two methods underlined.
   ![before](https://user-images.githubusercontent.com/5780138/112067786-b80b8b00-8b79-11eb-9d15-e8cb1f64db83.png)
   
   From the left `UnboxedReader.read` indicate that there are no benefits in `UnboxedReader`, as all calls end up going to the interface method, which returns an object. In the far right there is `get` which in case of a Java primitive, you would get a boxed primitive, the result of `get`  is passed to the `UnboxedReader` and it ignores it. The signature of read method `public T read(T ignored)`. After getting the result from the reader it's passed to the method `set` in `StructReader`.
   
   I will update other readers to use the new implementation and run the test you mentioned. If you have any suggestion regarding benchmarking, let me know, as it is a new ground for me.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini edited a comment on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini edited a comment on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804491609


   Old Ran 10 trials: mean time: 5781.100 ms, stddev: 78.045 ms
   New without setter: Ran 10 trials: mean time: 5166.300 ms, stddev: 61.861 ms. [commit](https://github.com/omaraloraini/iceberg/commit/69a11fd7d99e161fc4e3a997caecef0f601ba114).
   
   I update the schema in the test with more required and primitive columns, new schema:
   ```
       Schema structSchema = new Schema(
               required(1, "circumvent", Types.LongType.get()),
               optional(2, "antarctica", Types.StringType.get()),
               optional(3, "fluent", Types.DoubleType.get()),
               required(4, "quell", Types.StructType.of(
                       required(5, "operator", Types.BooleanType.get()),
                       optional(6, "fanta", Types.IntegerType.get()),
                       optional(7, "cable", Types.FloatType.get())
               )),
               required(8, "chimney", Types.TimestampType.withZone()),
               required(9, "wool", Types.DateType.get()),
               required(10, "c1", Types.LongType.get()),
               required(11, "c2", Types.LongType.get()),
               required(12, "c3", Types.LongType.get()),
               required(13, "c4", Types.IntegerType.get()),
               required(14, "c5", Types.IntegerType.get()),
               required(15, "c6", Types.IntegerType.get()),
               required(16, "c7", Types.FloatType.get()),
               required(17, "c8", Types.FloatType.get()),
               required(18, "c9", Types.DoubleType.get()),
               required(19, "c10", Types.DoubleType.get())
       );
   ```
   Result with new schema:
   Old: Ran 10 trials: mean time: 7528.300 ms, stddev: 97.378 ms
   New without setter: Ran 10 trials: mean time: 9742.600 ms, stddev: 97.785 ms
   New with setter!: Ran 10 trials: mean time: 3939.400 ms, stddev: 98.521 ms. [commit](https://github.com/omaraloraini/iceberg/commit/7b345df2f885615af3cf3733fc0d9c89380d31ba)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-809822338


   @omaraloraini, just want to check with you to see if you're willing to re-open and update this? I think this could be a good performance improvement after all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini edited a comment on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini edited a comment on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-800687074


   Hi Ryan, what kind of benchmark fits here? JMH? or should I run one on our jobs(at work)?
   I did a profiling run to validate the change, and as far as I can tell it seems it does the right thing.
   ![before](https://user-images.githubusercontent.com/5780138/111393621-20fe8900-86ca-11eb-8c68-4976478a6cc0.png)
   ![after](https://user-images.githubusercontent.com/5780138/111393626-23f97980-86ca-11eb-9789-c9830d16291e.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-800674902


   Thanks for working on this, @omaraloraini! Could you run some of the benchmarks to help us understand the performance after this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini closed pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini closed pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#discussion_r595605786



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
##########
@@ -696,8 +738,20 @@ public final T read(T reuse) {
       I intermediate = newStructData(reuse);
 
       for (int i = 0; i < readers.length; i += 1) {
-        set(intermediate, i, readers[i].read(get(intermediate, i)));
-        // setters[i].set(intermediate, i, get(intermediate, i));
+        final ParquetValueReader<?> reader = readers[i];
+        if (reader instanceof ParquetValueReader.OfBoolean) {
+          setBoolean(intermediate, i, ((OfBoolean) reader).readBoolean());
+        } else if (reader instanceof ParquetValueReader.OfInt) {
+          setInteger(intermediate, i, ((OfInt) reader).readInteger());
+        } else if (reader instanceof ParquetValueReader.OfLong) {
+          setLong(intermediate, i, ((OfLong) reader).readLong());
+        } else if (reader instanceof ParquetValueReader.OfFloat) {
+          setFloat(intermediate, i, ((OfFloat) reader).readFloat());
+        } else if (reader instanceof ParquetValueReader.OfDouble) {
+          setDouble(intermediate, i, ((OfDouble) reader).readDouble());
+        } else {
+          set(intermediate, i, reader.read(get(intermediate, i)));
+        }

Review comment:
       I think it would be better to continue using the `setters` produced by `newSetter` here because this logic will test each reader several times. I think that avoiding `if` statements by using a `Setter` implementation will be faster in the long run because it will incur invocation cost but not have near as many branches.
   
   Feel free to run the benchmarks and find which one is better, though. I'd be interested in testing the `setter` approach (uncomment the previous line 700) against this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini closed pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini closed pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-800687074


   Hi Ryan, what kind of benchmark fits here? JMH? or should I run one our jobs(at work)?
   I did a profiling run to validate the change, and as far as I can tell it seems it does the right thing.
   ![before](https://user-images.githubusercontent.com/5780138/111393621-20fe8900-86ca-11eb-8c68-4976478a6cc0.png)
   ![after](https://user-images.githubusercontent.com/5780138/111393626-23f97980-86ca-11eb-9789-c9830d16291e.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omaraloraini closed pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
omaraloraini closed pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2336: Reduce boxing in ParquetValueReaders.StructReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2336:
URL: https://github.com/apache/iceberg/pull/2336#issuecomment-804387286


   @omaraloraini, there are JMH benchmarks you can use, but I also wrote a quick test to compare time when writing the initial Parquet support. That's here if you want to try it out: https://github.com/apache/iceberg/blob/master/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetAvroReader.java#L76-L134
   
   I'm also not quite sure how to read those flame graphs, I think I'm missing some context like which one is which. I'd also like to compare the result of making the change I suggested above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org