You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2017/08/08 11:26:54 UTC

[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/18882

    [SPARK-21652][SQL] Filter out meaningless constraints inferred in inferAdditionalConstraints

    ## What changes were proposed in this pull request?
    This pr added code to filter out meaningless constraints inferred in `inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`, and  `b = c`, we inferred `a = b` and this predicate was trivially true). These constraints possibly cause some `Optimizer` overhead and, for example;
    ```
    scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    scala> Seq(1, 2).toDF("col").write.saveAsTable("t2")
    scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
    ```
    
    In this query, `InferFiltersFromConstraints` infers a new constraint '(col2#33 = col1#32)' that is appended to the join condition, then `PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, `ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification` finally removes this predicate. However, `InferFiltersFromConstraints` will again infer '(col2#33 = col1#32)' on the next iteration and the process will continue until the limit of iterations is reached.
    See below for more details
    
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
    !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                       Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = col#34)))
     :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                                      :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                                                +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                                                  +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
    !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = col#34)))              Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))   :- Filter (col2#33 = col1#32)
    !:  +- Relation[col1#32,col2#33] parquet                                                      :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
    !+- Filter ((1 = col#34) && isnotnull(col#34))                                                :     +- Relation[col1#32,col2#33] parquet
    !   +- Relation[col#34] parquet                                                               +- Filter ((1 = col#34) && isnotnull(col#34))
    !                                                                                                +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                          Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter (col2#33 = col1#32)                                                                   :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
    !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
    !:     +- Relation[col1#32,col2#33] parquet                                                      +- Filter ((1 = col#34) && isnotnull(col#34))
    !+- Filter ((1 = col#34) && isnotnull(col#34))                                                      +- Relation[col#34] parquet
    !   +- Relation[col#34] parquet                                                                  
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                                                Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))
     :  +- Relation[col1#32,col2#33] parquet                                                                               :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                                                                         +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                                                                           +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                                    Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))   :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && true)
     :  +- Relation[col1#32,col2#33] parquet                                                                   :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                                                             +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                                                               +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
     Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                                 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && true)   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                                                :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                                                          +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet                                                                            +- Relation[col#34] parquet
                    
    
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
    !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))                                       Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = col#34)))
     :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33)))
     :  +- Relation[col1#32,col2#33] parquet                                                      :  +- Relation[col1#32,col2#33] parquet
     +- Filter ((1 = col#34) && isnotnull(col#34))                                                +- Filter ((1 = col#34) && isnotnull(col#34))
        +- Relation[col#34] parquet  
    ```
    
    ## How was this patch tested?
    Added tests in `InferFiltersFromConstraintsSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-21652

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18882
    
----
commit d253e40788b9e3408c106eff0ba84ae97d715cbb
Author: Takeshi Yamamuro <ya...@apache.org>
Date:   2017-08-08T11:08:38Z

    Should not infer the constraints that are trivially true

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80389/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Thanks for working on it, but the inferred one is not useless. The removal has to be cost based.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    **[Test build #80389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)** for PR 18882 at commit [`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80390/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    **[Test build #80390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80390/testReport)** for PR 18882 at commit [`dc2112f`](https://github.com/apache/spark/commit/dc2112f6a26125cd6a67eac79cef91751ac639f8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    **[Test build #80390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80390/testReport)** for PR 18882 at commit [`dc2112f`](https://github.com/apache/spark/commit/dc2112f6a26125cd6a67eac79cef91751ac639f8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    **[Test build #80389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)** for PR 18882 at commit [`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

Posted by maropu <gi...@git.apache.org>.

Github user maropu closed the pull request at:

    https://github.com/apache/spark/pull/18882


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18882
  
    Any activity for cost-based inference? Anyway, thanks! I'll close this for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org