You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/06/24 08:36:16 UTC

[jira] [Commented] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

    [ https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347992#comment-15347992 ] 

Apache Spark commented on SPARK-16186:
--------------------------------------

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13887

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-16186
>                 URL: https://issues.apache.org/jira/browse/SPARK-16186
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Dongjoon Hyun
>
> One of the most frequent usage patterns for Spark SQL is using **cached tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(2000000000)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()    // About 2 mins
> scala> sql("select id from t where id = 1").collect()    // less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10)  // Enable. (Just to show this examples, currently the default value is 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org