You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/20 05:54:17 UTC
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008
It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk What do you think?
This is the benchmark of CSV:
```scala
val rowsNum = 100 * 1000
val numIters = 3
val colsNum = 100
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
val schema = StructType(StructField("key", IntegerType) +: fields)
def columns(): Seq[Column] = {
val ts = Seq.tabulate(colsNum) { i =>
lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
}
($"id" % 1000).as("key") +: ts
}
withTempPath { path =>
spark.range(rowsNum).select(columns(): _*)
.write.option("header", true)
.csv(path.getAbsolutePath)
def readback = {
spark.read
.option("header", true)
.schema(schema)
.csv(path.getAbsolutePath)
}
def withFilter(filer: String, configEnabled: Boolean): Unit = {
withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
readback.filter(filer).noop()
}
}
Seq(5, 10, 50, 100, 500).foreach { count =>
Seq(10, 50).foreach { distribution =>
val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
val benchmark = new Benchmark(title, rowsNum, output = output)
Seq(false, true).foreach { pushDownEnabled =>
val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
benchmark.addCase(name, numIters) { _ =>
val filter =
Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
val whereExpr = s"key in(${filter.mkString(",")})"
withFilter(whereExpr, configEnabled = pushDownEnabled)
}
}
benchmark.run()
}
}
}
```
Result:
```
================================================================================================
Benchmark to measure CSV read performance
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 13082 17077 1674 0.0 130815.6 1.0X
Native CSV Vectorized (Pushdown) 1172 1192 35 0.1 11719.5 11.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11858 12028 237 0.0 118576.9 1.0X
Native CSV Vectorized (Pushdown) 1165 1172 6 0.1 11652.4 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11883 12180 494 0.0 118834.3 1.0X
Native CSV Vectorized (Pushdown) 1142 1156 21 0.1 11418.6 10.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11857 11878 19 0.0 118570.4 1.0X
Native CSV Vectorized (Pushdown) 1169 1174 7 0.1 11692.9 10.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11923 11962 66 0.0 119228.0 1.0X
Native CSV Vectorized (Pushdown) 1196 1225 26 0.1 11960.7 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11910 11917 7 0.0 119095.3 1.0X
Native CSV Vectorized (Pushdown) 1191 1194 5 0.1 11908.0 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11948 12097 201 0.0 119484.5 1.0X
Native CSV Vectorized (Pushdown) 1250 1284 32 0.1 12501.4 9.6X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11938 11978 39 0.0 119378.8 1.0X
Native CSV Vectorized (Pushdown) 1176 1188 11 0.1 11756.0 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11954 12051 124 0.0 119542.9 1.0X
Native CSV Vectorized (Pushdown) 1762 1833 104 0.1 17620.6 6.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11860 12166 484 0.0 118597.8 1.0X
Native CSV Vectorized (Pushdown) 1417 1434 15 0.1 14171.7 8.4X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org