You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2020/05/19 07:33:00 UTC
[jira] [Commented] (SPARK-25784) Infer filters from constraints after rewriting predicate subquery

    [ https://issues.apache.org/jira/browse/SPARK-25784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110929#comment-17110929 ] 

Yuming Wang commented on SPARK-25784:
-------------------------------------

Teradata support this optimization:
https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw

> Infer filters from constraints after rewriting predicate subquery
> -----------------------------------------------------------------
>
>                 Key: SPARK-25784
>                 URL: https://issues.apache.org/jira/browse/SPARK-25784
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> Benchmark:
> {code:scala}
> withTempView("t1", "t2") {
>   withTempDir { dir =>
>     spark.range(3000000)
>       .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as c2", "id as c3")
>       .coalesce(1)
>       .orderBy("c2")
>       .write
>       .mode("overwrite")
>       .option("parquet.block.size", 10485760)
>       .parquet(dir.getCanonicalPath)
>     spark.read.parquet(dir.getCanonicalPath).createTempView("t1")
>     spark.read.parquet(dir.getCanonicalPath).createTempView("t2")
>     Seq("c1", "c2", "c3").foreach { column =>
>       val benchmark = new Benchmark(s"join key $column", 10)
>       Seq(false, true).foreach { inferFilters =>
>         benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { _ =>
>           withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> inferFilters.toString) {
>             sql(s"select t1.* from t1 where t1.$column in (select $column from t2)").count()
>           }
>         }
>       }
>       benchmark.run()
>     }
>   }
> }
> {code}
> Benchmark result:
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c1:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false                        2005 / 2163          0.0   200481431.0       1.0X
> Is infer filters true                          190 /  207          0.0    18962935.7      10.6X
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c2:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false                        2368 / 2498          0.0   236803743.1       1.0X
> Is infer filters true                         1234 / 1268          0.0   123443912.3       1.9X
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> join key c3:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> Is infer filters false                        2754 / 2907          0.0   275376009.7       1.0X
> Is infer filters true                         2237 / 2255          0.0   223739457.8       1.2X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org