You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2020/05/19 07:33:00 UTC
[jira] [Commented] (SPARK-25784) Infer filters from constraints
after rewriting predicate subquery
[ https://issues.apache.org/jira/browse/SPARK-25784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110929#comment-17110929 ]
Yuming Wang commented on SPARK-25784:
-------------------------------------
Teradata support this optimization:
https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw
> Infer filters from constraints after rewriting predicate subquery
> -----------------------------------------------------------------
>
> Key: SPARK-25784
> URL: https://issues.apache.org/jira/browse/SPARK-25784
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Yuming Wang
> Priority: Major
>
> Benchmark:
> {code:scala}
> withTempView("t1", "t2") {
> withTempDir { dir =>
> spark.range(3000000)
> .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as c2", "id as c3")
> .coalesce(1)
> .orderBy("c2")
> .write
> .mode("overwrite")
> .option("parquet.block.size", 10485760)
> .parquet(dir.getCanonicalPath)
> spark.read.parquet(dir.getCanonicalPath).createTempView("t1")
> spark.read.parquet(dir.getCanonicalPath).createTempView("t2")
> Seq("c1", "c2", "c3").foreach { column =>
> val benchmark = new Benchmark(s"join key $column", 10)
> Seq(false, true).foreach { inferFilters =>
> benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { _ =>
> withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> inferFilters.toString) {
> sql(s"select t1.* from t1 where t1.$column in (select $column from t2)").count()
> }
> }
> }
> benchmark.run()
> }
> }
> }
> {code}
> Benchmark result:
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c1: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false 2005 / 2163 0.0 200481431.0 1.0X
> Is infer filters true 190 / 207 0.0 18962935.7 10.6X
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c2: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false 2368 / 2498 0.0 236803743.1 1.0X
> Is infer filters true 1234 / 1268 0.0 123443912.3 1.9X
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c3: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false 2754 / 2907 0.0 275376009.7 1.0X
> Is infer filters true 2237 / 2255 0.0 223739457.8 1.2X
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org