You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/14 18:39:05 UTC
[GitHub] [iceberg] dramaticlly opened a new pull request, #5991: Spark: Fix DATE_ADD expression in IcebergSourceFlatParquetDataWriteBenchmark
dramaticlly opened a new pull request, #5991:
URL: https://github.com/apache/iceberg/pull/5991
fix https://github.com/apache/iceberg/issues/5990
## Verification
After my change, I am now seeing correct report generated
```
# JMH version: 1.32
# VM version: JDK 1.8.0_312, OpenJDK 64-Bit Server VM, 25.312-b07
# VM invoker: /Users/stevezhang/workspace/jdk8/applejdk-8.0.312.7.1.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/stevezhang/workspace/iceberg/spark/v3.3/spark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: full + dont-inline hint
# Warmup: 3 iterations, single-shot each
# Measurement: 5 iterations, single-shot each
# Timeout: 10 min per iteration
# Threads: 1 thread
# Benchmark mode: Single shot invocation time
# Benchmark: org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeFileSource
# Run progress: 0.00% complete, ETA 00:00:00
# Fork: 1 of 1
# Warmup Iteration 1: 25.867 s/op
# Warmup Iteration 2: 19.778 s/op
# Warmup Iteration 3: 18.966 s/op
Iteration 1: 19.017 s/op
Iteration 2: 18.209 s/op
Iteration 3: 19.078 s/op
Iteration 4: 22.087 s/op
Iteration 5: 18.014 s/op
Result "org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeFileSource":
N = 5
mean = 19.281 ±(99.9%) 6.310 s/op
Histogram, s/op:
[18.000, 18.500) = 2
[18.500, 19.000) = 0
[19.000, 19.500) = 2
[19.500, 20.000) = 0
[20.000, 20.500) = 0
[20.500, 21.000) = 0
[21.000, 21.500) = 0
[21.500, 22.000) = 0
[22.000, 22.500) = 1
Percentiles, s/op:
p(0.0000) = 18.014 s/op
p(50.0000) = 19.017 s/op
p(90.0000) = 22.087 s/op
p(95.0000) = 22.087 s/op
p(99.0000) = 22.087 s/op
p(99.9000) = 22.087 s/op
p(99.9900) = 22.087 s/op
p(99.9990) = 22.087 s/op
p(99.9999) = 22.087 s/op
p(100.0000) = 22.087 s/op
# JMH version: 1.32
# VM version: JDK 1.8.0_312, OpenJDK 64-Bit Server VM, 25.312-b07
# VM invoker: /Users/stevezhang/workspace/jdk8/applejdk-8.0.312.7.1.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/stevezhang/workspace/iceberg/spark/v3.3/spark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: full + dont-inline hint
# Warmup: 3 iterations, single-shot each
# Measurement: 5 iterations, single-shot each
# Timeout: 10 min per iteration
# Threads: 1 thread
# Benchmark mode: Single shot invocation time
# Benchmark: org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeIceberg
# Run progress: 50.00% complete, ETA 00:02:44
# Fork: 1 of 1
# Warmup Iteration 1: 23.999 s/op
# Warmup Iteration 2: 19.151 s/op
# Warmup Iteration 3: 19.056 s/op
Iteration 1: 22.485 s/op
Iteration 2: 19.256 s/op
Iteration 3: 19.343 s/op
Iteration 4: 21.488 s/op
Iteration 5: 20.735 s/op
Result "org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeIceberg":
N = 5
mean = 20.661 ±(99.9%) 5.352 s/op
Histogram, s/op:
[19.000, 19.250) = 0
[19.250, 19.500) = 2
[19.500, 19.750) = 0
[19.750, 20.000) = 0
[20.000, 20.250) = 0
[20.250, 20.500) = 0
[20.500, 20.750) = 1
[20.750, 21.000) = 0
[21.000, 21.250) = 0
[21.250, 21.500) = 1
[21.500, 21.750) = 0
[21.750, 22.000) = 0
[22.000, 22.250) = 0
[22.250, 22.500) = 1
[22.500, 22.750) = 0
Percentiles, s/op:
p(0.0000) = 19.256 s/op
p(50.0000) = 20.735 s/op
p(90.0000) = 22.485 s/op
p(95.0000) = 22.485 s/op
p(99.0000) = 22.485 s/op
p(99.9000) = 22.485 s/op
p(99.9900) = 22.485 s/op
p(99.9990) = 22.485 s/op
p(99.9999) = 22.485 s/op
p(100.0000) = 22.485 s/op
# Run complete. Total time: 00:05:33
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
IcebergSourceFlatParquetDataWriteBenchmark.writeFileSource ss 5 19.281 ± 6.310 s/op
IcebergSourceFlatParquetDataWriteBenchmark.writeIceberg ss 5 20.661 ± 5.352 s/op
Benchmark result is saved to /Users/stevezhang/workspace/iceberg/spark/v3.3/spark/build/results/jmh/results.txt
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on pull request #5991: Spark: Fix DATE_ADD expression in IcebergSourceFlatParquetDataWriteBenchmark
Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #5991:
URL: https://github.com/apache/iceberg/pull/5991#issuecomment-1279888693
> Hm, not familiar with Benchmark, did this ever work for previous sparks?
>
> Looks like date_add does take in int though https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3157 so it looks good to me.
I'm not sure if this was ever working, but it currently fails in master with the following error in the output file (tried 3.2).
```
# JMH version: 1.32
# VM version: JDK 11.0.15, OpenJDK 64-Bit Server VM, 11.0.15+9-LTS
# VM invoker: /Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/kylebendickson/repos/iceberg/spark/v3.2/spark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: full + dont-inline hint
# Warmup: 3 iterations, single-shot each
# Measurement: 5 iterations, single-shot each
# Timeout: 10 min per iteration
# Threads: 1 thread
# Benchmark mode: Single shot invocation time
# Benchmark: org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeIceberg
# Run progress: 50.00% complete, ETA 00:00:02
# Fork: 1 of 1
# Warmup Iteration 1: WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/kylebendickson/repos/iceberg/spark/v3.2/spark/build/libs/iceberg-spark-3.2_2.12-0.15.0-SNAPSHOT-jmh.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
<failure>
org.apache.spark.sql.AnalysisException: cannot resolve 'date_add(current_date(), (longCol % CAST(20 AS BIGINT)))' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, '(longCol % CAST(20 AS BIGINT))' is of bigint type.; line 1 pos 0;
'Project [longCol#2L, intCol#4, floatCol#7, doubleCol#11, decimalCol#16, date_add(current_date(Some(America/Los_Angeles)), (longCol#2L % cast(20 as bigint))) AS dateCol#22]
+- Project [longCol#2L, intCol#4, floatCol#7, doubleCol#11, cast(longCol#2L as decimal(20,5)) AS decimalCol#16]
+- Project [longCol#2L, intCol#4, floatCol#7, cast(longCol#2L as double) AS doubleCol#11]
+- Project [longCol#2L, intCol#4, cast(longCol#2L as float) AS floatCol#7]
+- Project [longCol#2L, cast(longCol#2L as int) AS intCol#4]
+- Project [id#0L AS longCol#2L]
+- Range (0, 5000000, step=1, splits=Some(1))
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:193)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:178)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.immutable.List.map(List.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:178)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:97)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:97)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:75)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:65)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1454)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2417)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2384)
at org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.benchmarkData(IcebergSourceFlatParquetDataWriteBenchmark.java:84)
at org.apache.iceberg.spark.source.parquet.IcebergSourceFlatParquetDataWriteBenchmark.writeIceberg(IcebergSourceFlatParquetDataWriteBenchmark.java:65)
at org.apache.iceberg.spark.source.parquet.jmh_generated.IcebergSourceFlatParquetDataWriteBenchmark_writeIceberg_jmhTest.writeIceberg_ss_jmhStub(IcebergSourceFlatParquetDataWriteBenchmark_writeIceberg_jmhTest.java:416)
at org.apache.iceberg.spark.source.parquet.jmh_generated.IcebergSourceFlatParquetDataWriteBenchmark_writeIceberg_jmhTest.writeIceberg_SingleShotTime(IcebergSourceFlatParquetDataWriteBenchmark_writeIceberg_jmhTest.java:371)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:470)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
# Run complete. Total time: 00:00:06
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
Benchmark result is saved to /Users/kylebendickson/repos/iceberg/spark/v3.2/spark/build/results/jmh/results.txt
```
This patch resolves the issue though (at least on 3.2).
Thanks @dramaticlly!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #5991: Spark: Fix DATE_ADD expression in IcebergSourceFlatParquetDataWriteBenchmark
Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5991:
URL: https://github.com/apache/iceberg/pull/5991#issuecomment-1281607715
Merged, thanks @dramaticlly and thanks @kbendick for confirming!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #5991: Spark: Fix DATE_ADD expression in IcebergSourceFlatParquetDataWriteBenchmark
Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5991:
URL: https://github.com/apache/iceberg/pull/5991#issuecomment-1281669251
Build works locally for me, maybe its a transient error, for post-commit build.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] szehon-ho merged pull request #5991: Spark: Fix DATE_ADD expression in IcebergSourceFlatParquetDataWriteBenchmark
Posted by GitBox <gi...@apache.org>.
szehon-ho merged PR #5991:
URL: https://github.com/apache/iceberg/pull/5991
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org