You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/12 15:48:08 UTC
[GitHub] [iceberg] chenjunjiedada edited a comment on pull request #1522: reuse container when reading parquet records

chenjunjiedada edited a comment on pull request #1522:
URL: https://github.com/apache/iceberg/pull/1522#issuecomment-707164336


   I executed some spark jmh cases with `NUM_RECORDS = 5000000` (I didn't use `10000000` in current code, because that causes OOM on my machine with jmh jvmArgs=-Xmx4096m, not sure how we passed before.), the results are shown as following:
   
   when not reuse the container
   ```
   Benchmark                                                                          Mode  Cnt  Score   Error  Units
   SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader                        ss    5  2.801 ?0.089   s/op
   SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe                  ss    5  3.383 ?0.090   s/op
   SparkParquetReadersNestedDataBenchmark.readUsingSparkReader                          ss    5  4.353 ?0.162   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader          ss    5  1.488 ?0.051   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe    ss    5  1.886 ?0.250   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader            ss    5  2.078 ?0.227   s/op
   ```
   
   When reusing the container
   ```
   Benchmark                                                                          Mode  Cnt  Score   Error  Units
   SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader                        ss    5  2.707 ?0.053   s/op
   SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe                  ss    5  3.149 ?0.144   s/op
   SparkParquetReadersNestedDataBenchmark.readUsingSparkReader                          ss    5  4.344 ?0.168   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader          ss    5  1.360 ?0.155   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe    ss    5  1.863 ?0.180   s/op
   SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader            ss    5  2.072 ?0.147   s/op
   ```
   
   It shows slight benefit when reusing the container. @rdblue, Does that make sense for spark side change? I will try to write some jmh benchmark for Flink input format and try it again.
   
   Plus I found two issues:
   1. The jmh benchmarks were moved to spark2 module while the comments haven't updated. 
   2. The jmh benchmarks cases throw an exception like below:
   ```
   org.apache.iceberg.exceptions.AlreadyExistsException: File already exists: /tmp/parquet-nested-data-benchmark3999980592702894424.parquet
           at org.apache.iceberg.Files$LocalOutputFile.create(Files.java:58)
           at org.apache.iceberg.parquet.ParquetIO$ParquetOutputFile.create(ParquetIO.java:148)
           at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:295)
           at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:283)
           at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)
           at org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:265)
           at org.apache.iceberg.spark.data.parquet.SparkParquetReadersNestedDataBenchmark.setupBenchmark(SparkParquetReadersNestedDataBenchmark.java:102)
           at org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest._jmh_tryInit_f_sparkparquetreadersnesteddatabenchmark0_G(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:438)
           at org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.readUsingSparkReader_SingleShotTime(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:363)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
           at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   We need to delete the created temp file at first. 
   
   I will fix found issues tmr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org