You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2018/05/10 04:48:03 UTC
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/21288
[SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code
## What changes were proposed in this pull request?
This pr added benchmark code `FilterPushdownBenchmark` for string pushdown and updated performance results.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark UpdateParquetBenchmark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21288.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21288
----
commit 223bf2008abfe5fd41c3b5e741dc525ab3864977
Author: Takeshi Yamamuro <ya...@...>
Date: 2018-05-03T00:17:21Z
Fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r190382044
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
Let us add a comment and also change the conf?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
I've check the metrics and I found that GC happend in case of `--diriver-memory 3g`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4112/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191650406
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
How about 2.3?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90440/testReport)** for PR 21288 at commit [`223bf20`](https://github.com/apache/spark/commit/223bf2008abfe5fd41c3b5e741dc525ab3864977).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91795/testReport)** for PR 21288 at commit [`d41e689`](https://github.com/apache/spark/commit/d41e68914e00a7ba6734b3fdbe839b130fbbd42e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189638131
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
So far, in Apache Spark project, we are testing with only **default** configurations. `snappy` will be the only exception because it's Spark's default compression and it's easy to get an idea in Parquet/ORC comparison.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90441/testReport)** for PR 21288 at commit [`8f60902`](https://github.com/apache/spark/commit/8f609023174c9f97bddc46bebe98f4ce3caf08c5).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189780637
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
I feel it'd be better to set 1.0 at the option for safety, too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91795/testReport)** for PR 21288 at commit [`d41e689`](https://github.com/apache/spark/commit/d41e68914e00a7ba6734b3fdbe839b130fbbd42e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3418/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21288
LGTM
Thanks! Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/21288
@gatorsmile and @maropu . I really appreciate this effort. Thanks.
Since this is a cloud benchmark, I have one thing to recommend. Can we use `r3.xlarge` for all benchmarks **consistently**? As we know, it's difficult to compare the result from different machines.
There are three reasons.
1. `r3.xlarge` is cheaper than `m4.2xlarge`.
2. Previous benchmark result cames from Macbook (SSD). `r3.xlarge` also provides SSD.
3. `r3.xlarge` is used at [Databricks TPCDS benchmark](https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html), too.
The following is the result on `r3.xlarge`; I launched the machine and build this PR on the latest master and run `bin/spark-submit --master local[1] --driver-memory 10G --conf spark.ui.enabled=false --class org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark sql/core/target/scala-2.11/spark-sql_2.11-2.0-SNAPSHOT-tests.jar`. (There is no hadoop installation. I guess @maropu also does.)
```
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9133 / 9275 1.7 580.6 1.0X
Parquet Vectorized (Pushdown) 85 / 100 185.2 5.4 107.6X
Native ORC Vectorized 8760 / 8843 1.8 556.9 1.0X
Native ORC Vectorized (Pushdown) 115 / 130 136.4 7.3 79.2X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9254 / 9276 1.7 588.4 1.0X
Parquet Vectorized (Pushdown) 912 / 922 17.2 58.0 10.1X
Native ORC Vectorized 8966 / 9013 1.8 570.1 1.0X
Native ORC Vectorized (Pushdown) 254 / 276 61.8 16.2 36.4X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9106 / 9136 1.7 578.9 1.0X
Parquet Vectorized (Pushdown) 897 / 910 17.5 57.0 10.2X
Native ORC Vectorized 8846 / 8889 1.8 562.4 1.0X
Native ORC Vectorized (Pushdown) 254 / 267 61.9 16.2 35.8X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9095 / 9124 1.7 578.3 1.0X
Parquet Vectorized (Pushdown) 891 / 899 17.7 56.6 10.2X
Native ORC Vectorized 8853 / 8941 1.8 562.8 1.0X
Native ORC Vectorized (Pushdown) 246 / 254 64.0 15.6 37.0X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9236 / 9273 1.7 587.2 1.0X
Parquet Vectorized (Pushdown) 902 / 910 17.4 57.4 10.2X
Native ORC Vectorized 8944 / 8965 1.8 568.6 1.0X
Native ORC Vectorized (Pushdown) 248 / 262 63.4 15.8 37.2X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select all string rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 20309 / 20381 0.8 1291.2 1.0X
Parquet Vectorized (Pushdown) 20437 / 20477 0.8 1299.3 1.0X
Native ORC Vectorized 24929 / 24999 0.6 1585.0 0.8X
Native ORC Vectorized (Pushdown) 24918 / 25040 0.6 1584.3 0.8X
```
As you see, the result is more consistent from the previous one and is different from this PR. Actually, I was reluctant to say this, but we had better have a standard way to generate a benchmark result on the cloud. If possible, I'd like to use `r3.xlarge`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91795/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91228/testReport)** for PR 21288 at commit [`d41e689`](https://github.com/apache/spark/commit/d41e68914e00a7ba6734b3fdbe839b130fbbd42e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91219/testReport)** for PR 21288 at commit [`2c0d5cb`](https://github.com/apache/spark/commit/2c0d5cbf51268540653543b96de135a6923c6cef).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191650442
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
Is it a regression?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/21288
Yep. Thank you for progressing this, @maropu !
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189158065
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
}
/*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
- Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
- Select 0 row (id IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7882 / 7957 2.0 501.1 1.0X
- Parquet Vectorized (Pushdown) 55 / 60 285.2 3.5 142.9X
- Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X
- Native ORC Vectorized (Pushdown) 66 / 70 237.2 4.2 118.9X
-
- Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7884 / 7909 2.0 501.2 1.0X
- Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X
- Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X
- Native ORC Vectorized (Pushdown) 81 / 83 195.2 5.1 97.8X
-
- Select 1 row (id = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7905 / 8027 2.0 502.6 1.0X
- Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X
- Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X
- Native ORC Vectorized (Pushdown) 78 / 81 202.4 4.9 101.7X
-
- Select 1 row (id <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7928 / 7993 2.0 504.1 1.0X
- Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X
- Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 104.8X
-
- Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7939 / 8021 2.0 504.8 1.0X
- Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X
- Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X
- Native ORC Vectorized (Pushdown) 76 / 79 206.7 4.8 104.3X
-
- Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7972 / 8019 2.0 506.9 1.0X
- Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X
- Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 105.4X
-
- Select 10% rows (id < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 8733 / 8808 1.8 555.2 1.0X
- Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X
- Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X
- Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X
-
- Select 50% rows (id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 11518 / 11591 1.4 732.3 1.0X
- Parquet Vectorized (Pushdown) 7962 / 7991 2.0 506.2 1.4X
- Native ORC Vectorized 8927 / 8985 1.8 567.6 1.3X
- Native ORC Vectorized (Pushdown) 6102 / 6160 2.6 387.9 1.9X
-
- Select 90% rows (id < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14255 / 14389 1.1 906.3 1.0X
- Parquet Vectorized (Pushdown) 13564 / 13594 1.2 862.4 1.1X
- Native ORC Vectorized 11442 / 11608 1.4 727.5 1.2X
- Native ORC Vectorized (Pushdown) 10991 / 11029 1.4 698.8 1.3X
-
- Select all rows (id IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14917 / 14938 1.1 948.4 1.0X
- Parquet Vectorized (Pushdown) 14910 / 14964 1.1 948.0 1.0X
- Native ORC Vectorized 11986 / 12069 1.3 762.0 1.2X
- Native ORC Vectorized (Pushdown) 12037 / 12123 1.3 765.3 1.2X
-
- Select all rows (id > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14951 / 14976 1.1 950.6 1.0X
- Parquet Vectorized (Pushdown) 14934 / 15016 1.1 949.5 1.0X
- Native ORC Vectorized 12000 / 12156 1.3 763.0 1.2X
- Native ORC Vectorized (Pushdown) 12079 / 12113 1.3 767.9 1.2X
-
- Select all rows (id != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14930 / 14972 1.1 949.3 1.0X
- Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X
- Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X
- Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
--- End diff --
Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91946 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91946/testReport)** for PR 21288 at commit [`4a9cec9`](https://github.com/apache/spark/commit/4a9cec91f9446161d4dde0cac20ccdccb9a112e7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91210/testReport)** for PR 21288 at commit [`b7859ed`](https://github.com/apache/spark/commit/b7859ed0905ce3e0476e5d327f65798acc7aba8c).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90440/testReport)** for PR 21288 at commit [`223bf20`](https://github.com/apache/spark/commit/223bf2008abfe5fd41c3b5e741dc525ab3864977).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/191/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191620766
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
That might be, but I feel the change was too big... I probably think that I had some mistakes in the last benchmark runs (I've not found why yet though).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3095/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
yea, I also agree with the opinion; we'd be better to run benchmarks on the same machine.
I'll re-run the benchmark on `r3.xlarge` to check if I could get the same result.
> There is no hadoop installation. I guess @maropu also does
yea, I had no installation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191280132
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
The difference is huge. What happened?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91815/testReport)** for PR 21288 at commit [`fa53156`](https://github.com/apache/spark/commit/fa53156599812adc94f089b8c163224fb2e4935f).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3191/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189635143
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
For ORC, the ORC has the conf called `orc.dictionary.key.threshold`. Do we need to set the conf here? cc @dongjoon-hyun
```
DICTIONARY_KEY_SIZE_THRESHOLD("orc.dictionary.key.threshold",
"hive.exec.orc.dictionary.key.size.threshold",
0.8,
"If the number of distinct keys in a dictionary is greater than this\n" +
"fraction of the total number of non-null rows, turn off \n" +
"dictionary encoding. Use 1 to always use dictionary encoding.")
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90441/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191283013
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
yea, I thinks so. But, not sure. I tried to run multiple times though, I didn't get the old performance values...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91228/testReport)** for PR 21288 at commit [`d41e689`](https://github.com/apache/spark/commit/d41e68914e00a7ba6734b3fdbe839b130fbbd42e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21288
@maropu Could you fix the style?
BTW, based on the latest result, Parquet is generally faster than ORC. cc @dongjoon-hyun @rdblue
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91228/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3405/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90454/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91211/testReport)** for PR 21288 at commit [`2c0d5cb`](https://github.com/apache/spark/commit/2c0d5cbf51268540653543b96de135a6923c6cef).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91857/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
Thanks for the check! btw, `DataSourceReadBenchmark` has the same issue (`spark.master` setup), so is it ok to fix this as follow-up?
https://github.com/apache/spark/compare/master...maropu:FixDataSourceReadBenchmark
Also, I update the bench on `r3.xlarge`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4082/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21288
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3096/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91857/testReport)** for PR 21288 at commit [`d3dd504`](https://github.com/apache/spark/commit/d3dd50463c2b91ae8800dbcc811dcc52880a02ca).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4014/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3628/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195940979
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ // Since `spark.master` always exists, overrides this value
+ .set("spark.master", "local[1]")
--- End diff --
Could you update `m4.2xlarge` in the PR description and add `spark.master` at line 34, too?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
ok
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90878/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91946/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3627/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189490682
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
*/
object FilterPushdownBenchmark {
val conf = new SparkConf()
- conf.set("orc.compression", "snappy")
- conf.set("spark.sql.parquet.compression.codec", "snappy")
+ .setMaster("local[1]")
--- End diff --
ok
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189175140
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
*/
object FilterPushdownBenchmark {
val conf = new SparkConf()
- conf.set("orc.compression", "snappy")
- conf.set("spark.sql.parquet.compression.codec", "snappy")
+ .setMaster("local[1]")
--- End diff --
I think you can do `.setIfMissing("spark.master", "local[1]")`
that way perhaps we could get this to run on different backends too
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195949711
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ // Since `spark.master` always exists, overrides this value
+ .set("spark.master", "local[1]")
--- End diff --
I'm afraid that other developers might misunderstand how-to-use this?
```
spark-submit --master local[1] --class <this class> <spark sql test jar>
spark-submit --master local[*] --class <this class> <spark sql test jar>
````
In both case, the benchmark always uses `local[1]`. Or, you suggest the other point of view?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90883/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r187822729
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
*/
object FilterPushdownBenchmark {
val conf = new SparkConf()
- conf.set("orc.compression", "snappy")
- conf.set("spark.sql.parquet.compression.codec", "snappy")
+ .setMaster("local[1]")
+ .setAppName("FilterPushdownBenchmark")
+ .set("spark.driver.memory", "3g")
--- End diff --
aha, ok. Looks good to me.
I just added this along with other benchmark code, e.g., `TPCDSQueryBenchmark`.
If no problem, I'll fix the other places in follow-up.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90441/testReport)** for PR 21288 at commit [`8f60902`](https://github.com/apache/spark/commit/8f609023174c9f97bddc46bebe98f4ce3caf08c5).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195946683
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ // Since `spark.master` always exists, overrides this value
+ .set("spark.master", "local[1]")
--- End diff --
btw, I updated the description. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189120527
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
}
/*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
- Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
- Select 0 row (id IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7882 / 7957 2.0 501.1 1.0X
- Parquet Vectorized (Pushdown) 55 / 60 285.2 3.5 142.9X
- Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X
- Native ORC Vectorized (Pushdown) 66 / 70 237.2 4.2 118.9X
-
- Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7884 / 7909 2.0 501.2 1.0X
- Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X
- Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X
- Native ORC Vectorized (Pushdown) 81 / 83 195.2 5.1 97.8X
-
- Select 1 row (id = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7905 / 8027 2.0 502.6 1.0X
- Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X
- Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X
- Native ORC Vectorized (Pushdown) 78 / 81 202.4 4.9 101.7X
-
- Select 1 row (id <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7928 / 7993 2.0 504.1 1.0X
- Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X
- Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 104.8X
-
- Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7939 / 8021 2.0 504.8 1.0X
- Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X
- Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X
- Native ORC Vectorized (Pushdown) 76 / 79 206.7 4.8 104.3X
-
- Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7972 / 8019 2.0 506.9 1.0X
- Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X
- Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 105.4X
-
- Select 10% rows (id < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 8733 / 8808 1.8 555.2 1.0X
- Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X
- Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X
- Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X
-
- Select 50% rows (id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 11518 / 11591 1.4 732.3 1.0X
- Parquet Vectorized (Pushdown) 7962 / 7991 2.0 506.2 1.4X
- Native ORC Vectorized 8927 / 8985 1.8 567.6 1.3X
- Native ORC Vectorized (Pushdown) 6102 / 6160 2.6 387.9 1.9X
-
- Select 90% rows (id < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14255 / 14389 1.1 906.3 1.0X
- Parquet Vectorized (Pushdown) 13564 / 13594 1.2 862.4 1.1X
- Native ORC Vectorized 11442 / 11608 1.4 727.5 1.2X
- Native ORC Vectorized (Pushdown) 10991 / 11029 1.4 698.8 1.3X
-
- Select all rows (id IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14917 / 14938 1.1 948.4 1.0X
- Parquet Vectorized (Pushdown) 14910 / 14964 1.1 948.0 1.0X
- Native ORC Vectorized 11986 / 12069 1.3 762.0 1.2X
- Native ORC Vectorized (Pushdown) 12037 / 12123 1.3 765.3 1.2X
-
- Select all rows (id > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14951 / 14976 1.1 950.6 1.0X
- Parquet Vectorized (Pushdown) 14934 / 15016 1.1 949.5 1.0X
- Native ORC Vectorized 12000 / 12156 1.3 763.0 1.2X
- Native ORC Vectorized (Pushdown) 12079 / 12113 1.3 767.9 1.2X
-
- Select all rows (id != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14930 / 14972 1.1 949.3 1.0X
- Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X
- Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X
- Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
--- End diff --
ok, I used `m4.2xlarge`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90904/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90878/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/21288
One more thing; I prefer Macbook performance tests because the cost of EC2 is always a barrier to developers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91219/testReport)** for PR 21288 at commit [`2c0d5cb`](https://github.com/apache/spark/commit/2c0d5cbf51268540653543b96de135a6923c6cef).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91821/testReport)** for PR 21288 at commit [`d3dd504`](https://github.com/apache/spark/commit/d3dd50463c2b91ae8800dbcc811dcc52880a02ca).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90454/testReport)** for PR 21288 at commit [`8f60902`](https://github.com/apache/spark/commit/8f609023174c9f97bddc46bebe98f4ce3caf08c5).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91914/testReport)** for PR 21288 at commit [`4a9cec9`](https://github.com/apache/spark/commit/4a9cec91f9446161d4dde0cac20ccdccb9a112e7).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91211/testReport)** for PR 21288 at commit [`2c0d5cb`](https://github.com/apache/spark/commit/2c0d5cbf51268540653543b96de135a6923c6cef).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91815/testReport)** for PR 21288 at commit [`fa53156`](https://github.com/apache/spark/commit/fa53156599812adc94f089b8c163224fb2e4935f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4011/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3102/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90440/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/125/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91914/testReport)** for PR 21288 at commit [`4a9cec9`](https://github.com/apache/spark/commit/4a9cec91f9446161d4dde0cac20ccdccb9a112e7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91815/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r187764083
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
*/
object FilterPushdownBenchmark {
val conf = new SparkConf()
- conf.set("orc.compression", "snappy")
- conf.set("spark.sql.parquet.compression.codec", "snappy")
+ .setMaster("local[1]")
+ .setAppName("FilterPushdownBenchmark")
+ .set("spark.driver.memory", "3g")
--- End diff --
these and master - change to setIfMissing()? I think it's great if these can be set via config
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/147/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195946600
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ // Since `spark.master` always exists, overrides this value
+ .set("spark.master", "local[1]")
--- End diff --
In the current pr, we cannot use `spark.master` in command line options. You suggest we drop `.set("spark.master", "local[1]")` and we always set `spark.master` in options for this benchmark?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90878/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90904/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195304544
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
The result in v2.3.1: https://gist.github.com/maropu/88627246b7143ede5ab73c7183ab2128
That is not a regression, but I probably run the bench in wrong branch or commit.
I re-ran the bench in the current master and updated the pr.
how-to-run: I created a new `m4.2xlarge` instance, fetched this pr, rebased to master, and run the bench.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91219/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21288
Sure
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91821/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90571/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90883/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195262751
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
I have time today, so I'll check v2.3.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3409/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91857/testReport)** for PR 21288 at commit [`d3dd504`](https://github.com/apache/spark/commit/d3dd50463c2b91ae8800dbcc811dcc52880a02ca).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
I noticed why the big performance value changes happened in https://github.com/apache/spark/pull/21288#discussion_r191280132; that's because [the commit](./https://github.com/apache/spark/pull/21288/commits/39e5a507fe22cade6bed0613eefbccab15cf45ff) wrongly set `local[*]` at `spark.master` instead of `local[1]`;
```
// Performance results on r3.xlarge
// --master local[1] --driver-memory 10G --conf spark.ui.enabled=false
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9292 / 9315 1.7 590.8 1.0X
Parquet Vectorized (Pushdown) 921 / 933 17.1 58.6 10.1X
Native ORC Vectorized 9001 / 9021 1.7 572.3 1.0X
Native ORC Vectorized (Pushdown) 257 / 265 61.2 16.3 36.2X
Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9151 / 9162 1.7 581.8 1.0X
Parquet Vectorized (Pushdown) 902 / 917 17.4 57.3 10.1X
Native ORC Vectorized 8870 / 8882 1.8 564.0 1.0X
Native ORC Vectorized (Pushdown) 254 / 268 61.9 16.1 36.0X
...
// --master local[*] --driver-memory 10G --conf spark.ui.enabled=false
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 3959 / 4067 4.0 251.7 1.0X
Parquet Vectorized (Pushdown) 202 / 245 77.7 12.9 19.6X
Native ORC Vectorized 3973 / 4055 4.0 252.6 1.0X
Native ORC Vectorized (Pushdown) 286 / 345 55.0 18.2 13.8X
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 3985 / 4022 3.9 253.4 1.0X
Parquet Vectorized (Pushdown) 249 / 274 63.3 15.8 16.0X
Native ORC Vectorized 4066 / 4122 3.9 258.5 1.0X
Native ORC Vectorized (Pushdown) 257 / 310 61.3 16.3 15.5X
```
I'll fix the bug and update the results in following prs. Sorry, all.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3635/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189639582
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
The current data fits the threshold. I am just afraid the comment might be invalid if the underlying files are not using dictionary encoding. Even if we do not change the format, we still need to update the comment.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189018667
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
}
/*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
- Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
- Select 0 row (id IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7882 / 7957 2.0 501.1 1.0X
- Parquet Vectorized (Pushdown) 55 / 60 285.2 3.5 142.9X
- Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X
- Native ORC Vectorized (Pushdown) 66 / 70 237.2 4.2 118.9X
-
- Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7884 / 7909 2.0 501.2 1.0X
- Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X
- Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X
- Native ORC Vectorized (Pushdown) 81 / 83 195.2 5.1 97.8X
-
- Select 1 row (id = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7905 / 8027 2.0 502.6 1.0X
- Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X
- Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X
- Native ORC Vectorized (Pushdown) 78 / 81 202.4 4.9 101.7X
-
- Select 1 row (id <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7928 / 7993 2.0 504.1 1.0X
- Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X
- Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 104.8X
-
- Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7939 / 8021 2.0 504.8 1.0X
- Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X
- Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X
- Native ORC Vectorized (Pushdown) 76 / 79 206.7 4.8 104.3X
-
- Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7972 / 8019 2.0 506.9 1.0X
- Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X
- Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 105.4X
-
- Select 10% rows (id < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 8733 / 8808 1.8 555.2 1.0X
- Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X
- Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X
- Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X
-
- Select 50% rows (id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 11518 / 11591 1.4 732.3 1.0X
- Parquet Vectorized (Pushdown) 7962 / 7991 2.0 506.2 1.4X
- Native ORC Vectorized 8927 / 8985 1.8 567.6 1.3X
- Native ORC Vectorized (Pushdown) 6102 / 6160 2.6 387.9 1.9X
-
- Select 90% rows (id < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14255 / 14389 1.1 906.3 1.0X
- Parquet Vectorized (Pushdown) 13564 / 13594 1.2 862.4 1.1X
- Native ORC Vectorized 11442 / 11608 1.4 727.5 1.2X
- Native ORC Vectorized (Pushdown) 10991 / 11029 1.4 698.8 1.3X
-
- Select all rows (id IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14917 / 14938 1.1 948.4 1.0X
- Parquet Vectorized (Pushdown) 14910 / 14964 1.1 948.0 1.0X
- Native ORC Vectorized 11986 / 12069 1.3 762.0 1.2X
- Native ORC Vectorized (Pushdown) 12037 / 12123 1.3 765.3 1.2X
-
- Select all rows (id > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14951 / 14976 1.1 950.6 1.0X
- Parquet Vectorized (Pushdown) 14934 / 15016 1.1 949.5 1.0X
- Native ORC Vectorized 12000 / 12156 1.3 763.0 1.2X
- Native ORC Vectorized (Pushdown) 12079 / 12113 1.3 767.9 1.2X
-
- Select all rows (id != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14930 / 14972 1.1 949.3 1.0X
- Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X
- Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X
- Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
--- End diff --
Hi, @maropu .
Thank you for updating with new Parquet 1.10.
Could you elaborate a little more about your EC2 environment and the step you did in PR description?
I'm trying to reproduce this, but in my mac the result doesn't have the same pattern with this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90571/testReport)** for PR 21288 at commit [`4520044`](https://github.com/apache/spark/commit/4520044d3be40ba8bf963a151db2dd9769c0f59a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191109472
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ .setIfMissing("spark.master", "local[1]")
+ .setIfMissing("spark.driver.memory", "3g")
+ .setIfMissing("spark.executor.memory", "3g")
+ .setIfMissing("orc.compression", "snappy")
+ .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+ private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+ def withTempPath(f: File => Unit): Unit = {
+ val path = Utils.createTempDir()
+ path.delete()
+ try f(path) finally Utils.deleteRecursively(path)
+ }
+
+ def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+ try f finally tableNames.foreach(spark.catalog.dropTempView)
+ }
+
+ def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+ val (keys, values) = pairs.unzip
+ val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+ (keys, values).zipped.foreach(spark.conf.set)
+ try f finally {
+ keys.zip(currentValues).foreach {
+ case (key, Some(value)) => spark.conf.set(key, value)
+ case (key, None) => spark.conf.unset(key)
+ }
+ }
+ }
+
+ private def prepareTable(
+ dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
+ import spark.implicits._
+ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+ val valueCol = if (useStringForValue) {
+ monotonically_increasing_id().cast("string")
+ } else {
+ monotonically_increasing_id()
+ }
+ val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
+ .withColumn("value", valueCol)
+ .sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def prepareStringDictTable(
+ dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
+ val selectExpr = (0 to width).map {
+ case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+ case i => s"CAST(rand() AS STRING) c$i"
+ }
+ val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+ saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+ saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+ }
+
+ private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").orc(dir)
+ spark.read.orc(dir).createOrReplaceTempView("orcTable")
+ }
+
+ private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+ df.write.mode("overwrite").parquet(dir)
+ spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+ }
+
+ def filterPushDownBenchmark(
+ values: Int,
+ title: String,
+ whereExpr: String,
+ selectExpr: String = "*"): Unit = {
+ val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ Seq(false, true).foreach { pushDownEnabled =>
+ val name = s"Native ORC Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
+ benchmark.addCase(name) { _ =>
+ withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$pushDownEnabled") {
+ spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").collect()
+ }
+ }
+ }
+
+ /*
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+ Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
+ Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
+ Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+
+
+ Select 0 string row
+ ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8532 / 8564 1.8 542.4 1.0X
+ Parquet Vectorized (Pushdown) 366 / 386 43.0 23.3 23.3X
+ Native ORC Vectorized 8289 / 8300 1.9 527.0 1.0X
+ Native ORC Vectorized (Pushdown) 378 / 385 41.6 24.0 22.6X
+
+
+ Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8547 / 8564 1.8 543.4 1.0X
+ Parquet Vectorized (Pushdown) 351 / 356 44.9 22.3 24.4X
+ Native ORC Vectorized 8310 / 8323 1.9 528.3 1.0X
+ Native ORC Vectorized (Pushdown) 370 / 375 42.5 23.5 23.1X
+
+
+ Select 1 string row
+ (value <=> '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8537 / 8563 1.8 542.8 1.0X
+ Parquet Vectorized (Pushdown) 310 / 319 50.7 19.7 27.5X
+ Native ORC Vectorized 8316 / 8335 1.9 528.7 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 367 43.2 23.1 23.5X
+
+
+ Select 1 string row
+ ('7864320' <= value <= '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8594 / 8607 1.8 546.4 1.0X
+ Parquet Vectorized (Pushdown) 370 / 374 42.5 23.5 23.2X
+ Native ORC Vectorized 8350 / 8358 1.9 530.9 1.0X
+ Native ORC Vectorized (Pushdown) 371 / 374 42.4 23.6 23.2X
+
+
+ Select all string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19601 / 19625 0.8 1246.2 1.0X
+ Parquet Vectorized (Pushdown) 19698 / 19703 0.8 1252.3 1.0X
+ Native ORC Vectorized 19435 / 19470 0.8 1235.6 1.0X
+ Native ORC Vectorized (Pushdown) 19568 / 19590 0.8 1244.1 1.0X
+
+
+ Select 0 int row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7815 / 7824 2.0 496.9 1.0X
+ Parquet Vectorized (Pushdown) 245 / 251 64.2 15.6 31.9X
+ Native ORC Vectorized 7436 / 7460 2.1 472.8 1.1X
+ Native ORC Vectorized (Pushdown) 344 / 351 45.7 21.9 22.7X
+
+
+ Select 0 int row
+ (7864320 < value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7792 / 7807 2.0 495.4 1.0X
+ Parquet Vectorized (Pushdown) 349 / 353 45.1 22.2 22.3X
+ Native ORC Vectorized 7451 / 7465 2.1 473.7 1.0X
+ Native ORC Vectorized (Pushdown) 365 / 368 43.0 23.2 21.3X
+
+
+ Select 1 int row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7836 / 7872 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 322 / 327 48.8 20.5 24.3X
+ Native ORC Vectorized 7533 / 7540 2.1 478.9 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 363 43.9 22.8 21.9X
+
+
+ Select 1 int row (value <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7855 / 7870 2.0 499.4 1.0X
+ Parquet Vectorized (Pushdown) 286 / 297 54.9 18.2 27.4X
+ Native ORC Vectorized 7511 / 7557 2.1 477.5 1.0X
+ Native ORC Vectorized (Pushdown) 358 / 361 43.9 22.8 21.9X
+
+
+ Select 1 int row
+ (7864320 <= value <= 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7851 / 7870 2.0 499.2 1.0X
+ Parquet Vectorized (Pushdown) 345 / 347 45.6 21.9 22.8X
+ Native ORC Vectorized 7543 / 7554 2.1 479.6 1.0X
+ Native ORC Vectorized (Pushdown) 364 / 374 43.2 23.1 21.6X
+
+
+ Select 1 int row
+ (7864319 < value < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7837 / 7840 2.0 498.2 1.0X
+ Parquet Vectorized (Pushdown) 338 / 339 46.6 21.5 23.2X
+ Native ORC Vectorized 7524 / 7541 2.1 478.3 1.0X
+ Native ORC Vectorized (Pushdown) 361 / 364 43.6 22.9 21.7X
+
+
+ Select 10% int rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8864 / 8900 1.8 563.5 1.0X
+ Parquet Vectorized (Pushdown) 2088 / 2095 7.5 132.7 4.2X
+ Native ORC Vectorized 8562 / 8579 1.8 544.3 1.0X
+ Native ORC Vectorized (Pushdown) 2127 / 2131 7.4 135.2 4.2X
+
+
+ Select 50% int rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 12671 / 12684 1.2 805.6 1.0X
+ Parquet Vectorized (Pushdown) 9032 / 9041 1.7 574.2 1.4X
+ Native ORC Vectorized 12388 / 12411 1.3 787.6 1.0X
+ Native ORC Vectorized (Pushdown) 8873 / 8884 1.8 564.1 1.4X
+
+
+ Select 90% int rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 16481 / 16495 1.0 1047.8 1.0X
+ Parquet Vectorized (Pushdown) 15906 / 15919 1.0 1011.3 1.0X
+ Native ORC Vectorized 16224 / 16254 1.0 1031.5 1.0X
+ Native ORC Vectorized (Pushdown) 15632 / 15661 1.0 993.9 1.1X
+
+
+ Select all int rows (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17341 / 17354 0.9 1102.5 1.0X
+ Parquet Vectorized (Pushdown) 17463 / 17481 0.9 1110.2 1.0X
+ Native ORC Vectorized 17073 / 17089 0.9 1085.4 1.0X
+ Native ORC Vectorized (Pushdown) 17194 / 17232 0.9 1093.2 1.0X
+
+
+ Select all int rows (value > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17452 / 17467 0.9 1109.6 1.0X
+ Parquet Vectorized (Pushdown) 17613 / 17630 0.9 1119.8 1.0X
+ Native ORC Vectorized 17259 / 17271 0.9 1097.3 1.0X
+ Native ORC Vectorized (Pushdown) 17385 / 17429 0.9 1105.3 1.0X
+
+
+ Select all int rows (value != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 17363 / 17372 0.9 1103.9 1.0X
+ Parquet Vectorized (Pushdown) 17526 / 17535 0.9 1114.2 1.0X
+ Native ORC Vectorized 17052 / 17089 0.9 1084.2 1.0X
+ Native ORC Vectorized (Pushdown) 17209 / 17229 0.9 1094.1 1.0X
+
+
+ Select 0 distinct string row
+ (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7697 / 7751 2.0 489.4 1.0X
+ Parquet Vectorized (Pushdown) 264 / 284 59.5 16.8 29.1X
+ Native ORC Vectorized 6942 / 6970 2.3 441.4 1.1X
+ Native ORC Vectorized (Pushdown) 372 / 381 42.3 23.7 20.7X
+
+
+ Select 0 distinct string row
+ ('100' < value < '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7983 / 8018 2.0 507.5 1.0X
+ Parquet Vectorized (Pushdown) 334 / 337 47.0 21.3 23.9X
+ Native ORC Vectorized 7307 / 7313 2.2 464.5 1.1X
+ Native ORC Vectorized (Pushdown) 363 / 371 43.3 23.1 22.0X
+
+
+ Select 1 distinct string row
+ (value = '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7882 / 7915 2.0 501.1 1.0X
+ Parquet Vectorized (Pushdown) 504 / 522 31.2 32.1 15.6X
+ Native ORC Vectorized 7143 / 7155 2.2 454.1 1.1X
+ Native ORC Vectorized (Pushdown) 555 / 573 28.4 35.3 14.2X
+
+
+ Select 1 distinct string row
+ (value <=> '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 7898 / 7912 2.0 502.1 1.0X
+ Parquet Vectorized (Pushdown) 470 / 481 33.5 29.9 16.8X
+ Native ORC Vectorized 7135 / 7149 2.2 453.6 1.1X
+ Native ORC Vectorized (Pushdown) 552 / 557 28.5 35.1 14.3X
+
+
+ Select 1 distinct string row
+ ('100' <= value <= '100'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 8189 / 8213 1.9 520.7 1.0X
+ Parquet Vectorized (Pushdown) 527 / 534 29.9 33.5 15.5X
+ Native ORC Vectorized 7477 / 7498 2.1 475.3 1.1X
+ Native ORC Vectorized (Pushdown) 558 / 566 28.2 35.5 14.7X
+
+
+ Select all distinct string rows
+ (value IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
+ ------------------------------------------------------------------------------------------------
+ Parquet Vectorized 19462 / 19476 0.8 1237.4 1.0X
+ Parquet Vectorized (Pushdown) 19570 / 19582 0.8 1244.2 1.0X
+ Native ORC Vectorized 18577 / 18604 0.8 1181.1 1.0X
+ Native ORC Vectorized (Pushdown) 18701 / 18742 0.8 1189.0 1.0X
+ */
+ benchmark.run()
+ }
+
+ private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
+ Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr =>
+ val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = $mid",
+ s"value <=> $mid",
+ s"$mid <= value AND value <= $mid",
+ s"${mid - 1} < value AND value < ${mid + 1}"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 int row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq(10, 50, 90).foreach { percent =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select $percent% int rows (value < ${numRows * percent / 100})",
+ s"value < ${numRows * percent / 100}",
+ selectExpr
+ )
+ }
+
+ Seq("value IS NOT NULL", "value > -1", "value != -1").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all int rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ private def runStringBenchmark(
+ numRows: Int, width: Int, searchValue: Int, colType: String): Unit = {
+ Seq("value IS NULL", s"'$searchValue' < value AND value < '$searchValue'")
+ .foreach { whereExpr =>
+ val title = s"Select 0 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ Seq(
+ s"value = '$searchValue'",
+ s"value <=> '$searchValue'",
+ s"'$searchValue' <= value AND value <= '$searchValue'"
+ ).foreach { whereExpr =>
+ val title = s"Select 1 $colType row ($whereExpr)".replace("value AND value", "value")
+ filterPushDownBenchmark(numRows, title, whereExpr)
+ }
+
+ val selectExpr = (1 to width).map(i => s"MAX(c$i)").mkString("", ",", ", MAX(value)")
+
+ Seq("value IS NOT NULL").foreach { whereExpr =>
+ filterPushDownBenchmark(
+ numRows,
+ s"Select all $colType rows ($whereExpr)",
+ whereExpr,
+ selectExpr)
+ }
+ }
+
+ def main(args: Array[String]): Unit = {
+ val numRows = 1024 * 1024 * 15
+ val width = 5
+
+ // Pushdown for many distinct value case
+ withTempPath { dir =>
+ val mid = numRows / 2
+
+ withTempTable("orcTable", "patquetTable") {
+ Seq(true, false).foreach { useStringForValue =>
+ prepareTable(dir, numRows, width, useStringForValue)
+ if (useStringForValue) {
+ runStringBenchmark(numRows, width, mid, "string")
+ } else {
+ runIntBenchmark(numRows, width, mid)
+ }
+ }
+ }
+ }
+
+ // Pushdown for few distinct value case (use dictionary encoding)
--- End diff --
ok
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91210/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91821/testReport)** for PR 21288 at commit [`d3dd504`](https://github.com/apache/spark/commit/d3dd50463c2b91ae8800dbcc811dcc52880a02ca).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91946/testReport)** for PR 21288 at commit [`4a9cec9`](https://github.com/apache/spark/commit/4a9cec91f9446161d4dde0cac20ccdccb9a112e7).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r191610297
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
I have not tried it yet, but is it related to the recent change we made in the parquet reader?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90454/testReport)** for PR 21288 at commit [`8f60902`](https://github.com/apache/spark/commit/8f609023174c9f97bddc46bebe98f4ce3caf08c5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #91210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91210/testReport)** for PR 21288 at commit [`b7859ed`](https://github.com/apache/spark/commit/b7859ed0905ce3e0476e5d327f65798acc7aba8c).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195948346
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ * spark-submit --class <this class> <spark sql test jar>
+ */
+object FilterPushdownBenchmark {
+ val conf = new SparkConf()
+ .setAppName("FilterPushdownBenchmark")
+ // Since `spark.master` always exists, overrides this value
+ .set("spark.master", "local[1]")
--- End diff --
What I mean is adding `--master local[1]` at line 34, too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4037/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90883/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90904/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/220/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/122/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/21288
@dongjoon-hyun I got the same result in case of the same condition (enough memory), but, if `--diriver-memory 3g` (smaller memory), I got a little different results;
```
// --diriver-memory=3g (default)
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 10084 / 10154 1.6 641.1 1.0X
Parquet Vectorized (Pushdown) 967 / 1008 16.3 61.5 10.4X
Native ORC Vectorized 11088 / 11116 1.4 705.0 0.9X
Native ORC Vectorized (Pushdown) 270 / 278 58.2 17.2 37.3X
Select 1 string row (value = '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 10032 / 10085 1.6 637.8 1.0X
Parquet Vectorized (Pushdown) 959 / 998 16.4 61.0 10.5X
Native ORC Vectorized 11104 / 11128 1.4 706.0 0.9X
Native ORC Vectorized (Pushdown) 259 / 277 60.6 16.5 38.7X
...
// --diriver-memory=10g
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9201 / 9300 1.7 585.0 1.0X
Parquet Vectorized (Pushdown) 89 / 105 176.3 5.7 103.1X
Native ORC Vectorized 8886 / 8898 1.8 564.9 1.0X
Native ORC Vectorized (Pushdown) 110 / 128 143.4 7.0 83.9X
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized 9336 / 9357 1.7 593.6 1.0X
Parquet Vectorized (Pushdown) 927 / 937 17.0 58.9 10.1X
Native ORC Vectorized 9026 / 9041 1.7 573.9 1.0X
Native ORC Vectorized (Pushdown) 257 / 272 61.1 16.4 36.3X
...
```
The parquet has smaller memory footprint? I'm currently look into this (I updated the result in case of the enough memory).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21288
**[Test build #90571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90571/testReport)** for PR 21288 at commit [`4520044`](https://github.com/apache/spark/commit/4520044d3be40ba8bf963a151db2dd9769c0f59a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r195305634
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
}
/*
+ OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
- Parquet Vectorized 8452 / 8504 1.9 537.3 1.0X
- Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X
- Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
- Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
+ Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
+ Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
--- End diff --
Thank you for updating, @maropu .
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91211/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91914/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3644/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/21288#discussion_r189019277
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
}
/*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
- Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
- Select 0 row (id IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7882 / 7957 2.0 501.1 1.0X
- Parquet Vectorized (Pushdown) 55 / 60 285.2 3.5 142.9X
- Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X
- Native ORC Vectorized (Pushdown) 66 / 70 237.2 4.2 118.9X
-
- Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7884 / 7909 2.0 501.2 1.0X
- Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X
- Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X
- Native ORC Vectorized (Pushdown) 81 / 83 195.2 5.1 97.8X
-
- Select 1 row (id = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7905 / 8027 2.0 502.6 1.0X
- Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X
- Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X
- Native ORC Vectorized (Pushdown) 78 / 81 202.4 4.9 101.7X
-
- Select 1 row (id <=> 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7928 / 7993 2.0 504.1 1.0X
- Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X
- Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 104.8X
-
- Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7939 / 8021 2.0 504.8 1.0X
- Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X
- Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X
- Native ORC Vectorized (Pushdown) 76 / 79 206.7 4.8 104.3X
-
- Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 7972 / 8019 2.0 506.9 1.0X
- Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X
- Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X
- Native ORC Vectorized (Pushdown) 76 / 78 207.9 4.8 105.4X
-
- Select 10% rows (id < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 8733 / 8808 1.8 555.2 1.0X
- Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X
- Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X
- Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X
-
- Select 50% rows (id < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 11518 / 11591 1.4 732.3 1.0X
- Parquet Vectorized (Pushdown) 7962 / 7991 2.0 506.2 1.4X
- Native ORC Vectorized 8927 / 8985 1.8 567.6 1.3X
- Native ORC Vectorized (Pushdown) 6102 / 6160 2.6 387.9 1.9X
-
- Select 90% rows (id < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14255 / 14389 1.1 906.3 1.0X
- Parquet Vectorized (Pushdown) 13564 / 13594 1.2 862.4 1.1X
- Native ORC Vectorized 11442 / 11608 1.4 727.5 1.2X
- Native ORC Vectorized (Pushdown) 10991 / 11029 1.4 698.8 1.3X
-
- Select all rows (id IS NOT NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14917 / 14938 1.1 948.4 1.0X
- Parquet Vectorized (Pushdown) 14910 / 14964 1.1 948.0 1.0X
- Native ORC Vectorized 11986 / 12069 1.3 762.0 1.2X
- Native ORC Vectorized (Pushdown) 12037 / 12123 1.3 765.3 1.2X
-
- Select all rows (id > -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14951 / 14976 1.1 950.6 1.0X
- Parquet Vectorized (Pushdown) 14934 / 15016 1.1 949.5 1.0X
- Native ORC Vectorized 12000 / 12156 1.3 763.0 1.2X
- Native ORC Vectorized (Pushdown) 12079 / 12113 1.3 767.9 1.2X
-
- Select all rows (id != -1): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- -----------------------------------------------------------------------------------------------
- Parquet Vectorized 14930 / 14972 1.1 949.3 1.0X
- Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X
- Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X
- Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X
+ Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
--- End diff --
Hi, @maropu . Thank you for updating this with new Parquet 1.10. BTW, could you elaborate the EC2 description more clearly in the PR description? I want to reproduce this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21288
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org