You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/17 12:30:01 UTC

[GitHub] [iceberg] zhongyujiang opened a new pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

zhongyujiang opened a new pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910


   This PR fixs jmh Spark parquet benchmark, now the SparkParquetWritersFlatDataBenchmark throws the following exception:
   ```
   # Warmup Iteration   2: <failure>
   
   org.apache.iceberg.exceptions.AlreadyExistsException: File already exists: /Users/zhongyujiang/projects/github/iceberg/spark/v3.2/spark/build/tmp/jmh/parquet-flat-data-benchmark6286636693589332554.parquet
    at org.apache.iceberg.Files$LocalOutputFile.create(Files.java:58)
    at org.apache.iceberg.parquet.ParquetIO$ParquetOutputFile.create(ParquetIO.java:148)
    at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:329)
    at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:305)
    at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:284)
    at org.apache.iceberg.parquet.ParquetWriter.ensureWriterInitialized(ParquetWriter.java:114)
    at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:200)
    at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:233)
    at org.apache.iceberg.spark.data.parquet.SparkParquetWritersFlatDataBenchmark.writeUsingIcebergWriter(SparkParquetWritersFlatDataBenchmark.java:105)
    at org.apache.iceberg.spark.data.parquet.jmh_generated.SparkParquetWritersFlatDataBenchmark_writeUsingIcebergWriter_jmhTest.writeUsingIcebergWriter_ss_jmhStub(SparkParquetWritersFlatDataBenchmark_writeUsingIcebergWriter_jmhTest.java:416)
    at org.apache.iceberg.spark.data.parquet.jmh_generated.SparkParquetWritersFlatDataBenchmark_writeUsingIcebergWriter_jmhTest.writeUsingIcebergWriter_SingleShotTime(SparkParquetWritersFlatDataBenchmark_writeUsingIcebergWriter_jmhTest.java:371)
   ```
   This PR set the level of `TearDown` to `Level.Invocation` to clean up `dataFile`  for each benchmark method execution.
   @aokolnychyi could you take a look? thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r790155478



##########
File path: .github/workflows/jmh-bechmarks.yml
##########
@@ -76,6 +76,9 @@ jobs:
       SPARK_LOCAL_IP: localhost
     steps:
     - uses: actions/checkout@v2
+      with:
+        repository: ${{ github.event.inputs.repo }}
+        ref: ${{ github.event.inputs.ref }}

Review comment:
       And this [run](https://github.com/zhongyujiang/iceberg/actions/runs/1733299718)  shows benchmarks are working.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
nastra commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r789806099



##########
File path: spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersFlatDataBenchmark.java
##########
@@ -85,7 +86,7 @@ public void setupBenchmark() throws IOException {
     dataFile.delete();
   }
 
-  @TearDown
+  @TearDown(Level.Invocation)

Review comment:
       not sure but I think both should be Iteration rather than Invocation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
nastra commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1018614580


   @zhongyujiang can you please run https://github.com/zhongyujiang/iceberg/actions/workflows/jmh-bechmarks.yml in your fork and give it `"SparkParquetWritersFlatDataBenchmark","SparkParquetWritersNestedDataBenchmark"` as parameters to make sure the Benchmarks are working?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3910: Spark: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1019610699


   Running CI. @nastra, do you think this is ready to commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r790154698



##########
File path: .github/workflows/jmh-bechmarks.yml
##########
@@ -76,6 +76,9 @@ jobs:
       SPARK_LOCAL_IP: localhost
     steps:
     - uses: actions/checkout@v2
+      with:
+        repository: ${{ github.event.inputs.repo }}
+        ref: ${{ github.event.inputs.ref }}

Review comment:
       @nastra I run this in my fork, but the [result](https://github.com/zhongyujiang/iceberg/actions/runs/1733298769) shows  it didn't use the ref i chose. I think maybe we should specify repo and ref here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1047268745


   @zhongyujiang, @nastra, does this need to be rebased now that $4149 is in? I think that fixed some of the same issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #3910: Spark: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1047820650


   Thanks for reminding, updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1032584470


   Hi @rdblue, can you help merge this? Or do you have any other comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
nastra commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1018628020


   I noticed that running benchmarks is currently broken and will be fixed by https://github.com/apache/iceberg/pull/3946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1047774739


   I have rebased and backported this fix, and changed level from `Level.Invocation` to `Level.Iteration` to be consistent with #4149, both have the same effect here. Please check again. @nastra @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
nastra commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r790468238



##########
File path: .github/workflows/jmh-bechmarks.yml
##########
@@ -76,6 +76,9 @@ jobs:
       SPARK_LOCAL_IP: localhost
     steps:
     - uses: actions/checkout@v2
+      with:
+        repository: ${{ github.event.inputs.repo }}
+        ref: ${{ github.event.inputs.ref }}

Review comment:
       good catch!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r790154011



##########
File path: spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersFlatDataBenchmark.java
##########
@@ -85,7 +86,7 @@ public void setupBenchmark() throws IOException {
     dataFile.delete();
   }
 
-  @TearDown
+  @TearDown(Level.Invocation)

Review comment:
       Iteration and Invocation should have the same effect since BenchmarkMode here is SingleShotTime, but I think is should be Invocation since each Invocation will produce a file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a change in pull request #3910: Spark 3.2: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on a change in pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#discussion_r790154698



##########
File path: .github/workflows/jmh-bechmarks.yml
##########
@@ -76,6 +76,9 @@ jobs:
       SPARK_LOCAL_IP: localhost
     steps:
     - uses: actions/checkout@v2
+      with:
+        repository: ${{ github.event.inputs.repo }}
+        ref: ${{ github.event.inputs.ref }}

Review comment:
       @nastra I run this in my fork, but the [result](https://github.com/zhongyujiang/iceberg/actions/runs/1733298769) shows  it didn't use the ref i chose. And I found job `run-benchmark`  haven't specify repo and ref here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3910: Spark: Fix jmh Spark parquet benchmark

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3910:
URL: https://github.com/apache/iceberg/pull/3910#issuecomment-1048042278


   Thanks, @zhongyujiang!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org