You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2016/06/24 08:35:06 UTC

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/13887

    [SPARK-16186][SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

    ## What changes were proposed in this pull request?
    
    One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about 100 times faster.
    ```scala
    $ bin/spark-shell --driver-memory 6G
    scala> val df = spark.range(2000000000)
    scala> df.createOrReplaceTempView("t")
    scala> spark.catalog.cacheTable("t")
    scala> sql("select id from t where id = 1").collect()    // About 2 mins
    scala> sql("select id from t where id = 1").collect()    // less than 90ms
    scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
    scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10)  // Enable. (Just to show this examples, currently the default value is 10.)
    scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
    spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0)  // Disable
    scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
    ```
    
    This PR has impacts over 35 queries of TPC-DS if the tables are cached.
    
    ## How was this patch tested?
    
    Pass the Jenkins tests (including new testcases).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-16186

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13887.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13887
    
----
commit 3b36e9cfb033762205900200a2249b8da3ba11bd
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-06-24T08:30:36Z

    [SPARK-16186][SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Thank you, @cloud-fan !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68380891
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize =>
    +      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
    --- End diff --
    
    how about we do this optimization for `InSet`? It guarantees the list are all literals and the max length by default is 10. Then we can save the new config.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61165/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Hi, @cloud-fan .
    I updated the PR. IMO,
    - InSet is used for large size of `IN` .
    - This PR is used for small size of `IN`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61165/consoleFull)** for PR 13887 at commit [`3b36e9c`](https://github.com/apache/spark/commit/3b36e9cfb033762205900200a2249b8da3ba11bd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    @dongjoon-hyun That's a good point, the current patch is better for performance actually


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Let's go with current patch, I will review it now. Those things could be considered later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    I'm not sure, but it's just my hope. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    The some of frequent TPC-DS usages were STATE, ZIP, Color strings. The min/max of these values doesn't have much meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Thank you for your review and valuable improvement ideas, @davies . Let me rephrase about your ideas,
    
    1. For `IN` with single expression, we definitely had better improve the existing `OptimizeIn` optimizer. (like as you mentioned.)
    
    2. For **sparse** values and small IN, we also can replace `a IN (1,2,3)` into `a = 1 or a = 2 or a = 3`. In terms of generated java code size, the optimized version is smaller.
     ```scala
    scala> sql("explain codegen select * from (select explode(array('1','2','3')) a) where a in ('1','2','3')").collect().foreach(println)   // 80 lines
    scala> sql("explain codegen select * from (select explode(array('1','2','3')) a) where a = '1' or a = '2' or a ='3'").collect().foreach(println) // 65 lines
    ```
    3. For **consecutive and discretized** values, e.g. 2001, 2002, 2003, ..., 2004, we can improve more cheaper by replacing GreaterThanOrEqual/LessThanOrEqual in Logical Optimizer layer, maybe also `OptimizeIn`?
    
    4. For **ordered partitioned data sources** like InMemory/Parquet, 1~3 will effectively improve the performance.
    
    5. For **the other data sources (unordered, partitioned or not)** and large IN, we need to use INSET as a fallback operation to prevent any regressions.
    
    Are these all what you advise? If I missed something, please comment me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68474760
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize && list.forall(_.isInstanceOf[Literal]) =>
    --- End diff --
    
    Can we not have this config? Another optimize rule will garantee that the number of expression will not be big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68383690
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize =>
    +      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
    --- End diff --
    
    oh sorry I read the code wrong, yea the config is different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    @dongjoon-hyun Yes, 2) should check the constraints to make it idempotent 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Reverted, will merge this again once it passed jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Seems like a reasonable optimization to me, cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    For any IN that have more than one expression, we could add another GreaterThanOrEqual/LessThanOrEqual (not replace the IN).
    
    For 2, it's not that obvious yet, we can do that later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Okay. Let's summarize before updating PR.
    1. In general, `a IN (expression)` will be `a = expression`. `OptimizerIn` optimizer will take care of this.
    2. In general, `a IN (2001, 2002, ..., 2009)` will be `2001 <= a AND a <= 2003 AND a IN (2001, 2002, 2009)`. `OptimizerIn` optimizer will take care of this.
    Am I correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    BTW, we could use constraints to implement this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68380170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize =>
    +      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
    --- End diff --
    
    Oh, right. I miss that. I'll fix that by checking.
    Thank you for review, @cloud-fan !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61172/consoleFull)** for PR 13887 at commit [`dc3a848`](https://github.com/apache/spark/commit/dc3a848c4fdd1b5dad485d1e5c8c3e3e836abace).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Oh, you meant adding additional constraints by using **min** and **max**. I see.
    
    By the way, I have one question. If there are many predicates, does Spark use the predicate in a sorted order?
    
    I'm not sure the newly inserted `GreaterThanOrEqual/LessThanOrEqual` can be used first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    cc @rxin , @davies , @cloud-fan .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    @dongjoon-hyun There is only single predicate in Filter, it could be AND or OR, so it means we could control the order. For this case, I'm not sure the inserted GreaterThanOrEqual/LessThanOrEqual will come before IN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Hmm. The general idea is good. But, I still think this PR and the idea seem to be complementary to each other.
    
    Sorry, but, if possible, can we proceed that general idea in another PR? 
    
    It's a little bit beyond the title of this PR and the touched files will be different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Sorry, the jenkins has not finished ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Although I decided to make this PR after observing TPC-DS queries, I will definitely update this PR if there are another useful scenarios.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68381115
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize =>
    +      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
    --- End diff --
    
    But, that configuration is minimum threshold for InSet. So, the meaning is quite different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Oh, right. It's pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Hi, @davies .
    I removes the option-related stuff from the code/PR description/JIRA description according to your advice.
    Thank you for review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    LGTM, merging this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Hi, @davies .
    Now, it passed.
    If there is anything for me to do, please let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61210/consoleFull)** for PR 13887 at commit [`9d550e3`](https://github.com/apache/spark/commit/9d550e3b7885daacdb01e3e54d01f5157af20791).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Anyway, thank you in advance, @davies . :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61172/consoleFull)** for PR 13887 at commit [`dc3a848`](https://github.com/apache/spark/commit/dc3a848c4fdd1b5dad485d1e5c8c3e3e836abace).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Maybe, you are confused with https://github.com/apache/spark/pull/13900 .
    It passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68379466
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize =>
    +      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
    --- End diff --
    
    where do we make sure the `l` is always literal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Merged into master, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61172/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61210/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    @dongjoon-hyun Thanks for the patch, this optimization sounds reasonable.
    
    I'm thinking of is it possible to make the optimization for IN/INSET more general. We could have a optimizer to insert a GreaterThanOrEqual and LessThanOrEqual for IN/INSET (checking the data type to make sure they are orderable), it will be cheaper to evaluate. For IN with single expression, we could rewrite as EqualTo. When doing this, we should be careful for null, to respect the sematics of null in IN/INSET. By doing this, all the data sources could benefit from it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61165/consoleFull)** for PR 13887 at commit [`3b36e9c`](https://github.com/apache/spark/commit/3b36e9cfb033762205900200a2249b8da3ba11bd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13887


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13887: [SPARK-16186][SQL] Support partition batch pruning with ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13887
  
    **[Test build #61210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61210/consoleFull)** for PR 13887 at commit [`9d550e3`](https://github.com/apache/spark/commit/9d550e3b7885daacdb01e3e54d01f5157af20791).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13887: [SPARK-16186][SQL] Support partition batch prunin...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13887#discussion_r68475255
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -79,6 +79,11 @@ private[sql] case class InMemoryTableScanExec(
     
         case IsNull(a: Attribute) => statsFor(a).nullCount > 0
         case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0
    +
    +    case In(a: AttributeReference, list: Seq[Expression])
    +      if list.length <= inMemoryPartitionPruningMaxInSize && list.forall(_.isInstanceOf[Literal]) =>
    --- End diff --
    
    Sure! I'll remove the option related stuff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org