You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/02/20 09:02:25 UTC

[GitHub] spark pull request #16998: [SPARK-19665][SQL] Improve constraint propagation

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16998

    [SPARK-19665][SQL] Improve constraint propagation

    ## What changes were proposed in this pull request?
    
    If there are aliased expression in the projection, we propagate constraints by completely expanding the original constraints with aliases.
    
    This expanding costs much computation time when the number of aliases increases.
    
    Fully expanding all constraints at all the time makes iterative ML algorithms where a ML pipeline with many stages runs very slow. See #16785.
    
    Another issue is we actually don't need the additional constraints at most of time. For example, if there is a constraint "a > b", and "a" is aliased to "c" and "d". When we use this constraint in filtering, we don't need all constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is false, it is guaranteed that all other constraints are false too.
    
    ### Benchmark
    
    Run the following codes locally.
    
        import org.apache.spark.ml.{Pipeline, PipelineStage}
        import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
    
        val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
    
        val indexers = df.columns.tail.map(c => new StringIndexer()
          .setInputCol(c)
          .setOutputCol(s"${c}_indexed")
          .setHandleInvalid("skip"))
    
        val encoders = indexers.map(indexer => new OneHotEncoder()
          .setInputCol(indexer.getOutputCol)
          .setOutputCol(s"${indexer.getOutputCol}_encoded")
          .setDropLast(true))
    
        val stages: Array[PipelineStage] = indexers ++ encoders
        val pipeline = new Pipeline().setStages(stages)
    
        val startTime = System.nanoTime
        pipeline.fit(df).transform(df).show
        val runningTime = System.nanoTime - startTime
    
    Before this patch: 1786001 ms
    After this patch: 49972 ms
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 improve-constraints-generation-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16998
    
----
commit b4e514ade7ea478055db448bbf66f7a88caf3a86
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-03T07:08:47Z

    Improve the code to generate constraints.

commit 8c98a5c3ab1477408988c8cb682733e65dd554fc
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-09T04:07:09Z

    Use parallel collection to improve the function.

commit 1b9c5616633909ad4bdac4cff1534f0bee548a7a
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-17T03:18:21Z

    Merge remote-tracking branch 'upstream/master' into improve-constraints-generation

commit 278c31cf8aa27c71e0f5178bebcb426ec5fba6ce
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-17T03:23:39Z

    Revert parallel collection approach. Reduce aliased constraints.

commit 24fb723207d80a7c6068fd113430488b89ed9d0b
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-20T08:32:53Z

    Improve constraint propagation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    > I do not get your points. What does this mean? Constraint propagation is a bottom up mechanism for inferring the constraints. Can you elaborate your idea in the more formal way.
    
    We fully expand the constraints with aliased attributes now. For example, if there is a constraint "a > b", and current query plan aliases "a" to "c" and "d". The final constraints of this plan is "a > b", "c > b", "d > b".
    
    The values of those constraints are all the same, either all true or all false. So in case of inferring filters from the constraints, we only need "a > b", other aliased constraints "c > b", "d > b" are not necessary.
    
    > I did not read the code. Just wondering if we could miss the chance of plan optimization after this PR? What is the negative impact, if exists?
    
    The only one optimization I think would be affected is `PruneFilters`. `PruneFilters` will prune a condition if its child's constraints already contain the condition. Using above example to elaborate, if there is a `Filter` above the query plan and its condition is "c > b". As we only have "a > b" in the query plan's constraint, we can't prune the condition and the `Filter`.
    
    However, this is not a big impact and it can be easily solved. We can use a simple method to inquire if a given condition like "c > b" is contained in the fully expanded constraints of a query plan, without really fully expanding the constraints.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73163/testReport)** for PR 16998 at commit [`d691c66`](https://github.com/apache/spark/commit/d691c66dd0092ad99751964a6b079193706b953c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    cc @cloud-fan @hvanhovell @sameeragarwal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73174/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @viirya does this PR supersede #16785? I do like the non-parallel approach. I will try to take a more in-depth look at the end of the week (beginning of the next sprint).  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/16998


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by chriso <gi...@git.apache.org>.
Github user chriso commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    I just run into the same issue with `spark-2.1.0`
    
    Here's a minimal test case:
    
    ```scala
    val max = 12 // try increasing this
    
    val filter = (for (i <- 0 to max) yield col("value") <=> i) reduce (_ || _)
    val projections = for (i <- 0 to max) yield (col("value") <=> i).as(s"value_$i")
    
    val otherFilter = lit(true) // this can be anything
    
    val df = Seq.empty[Int].toDF
    val result = df.filter(otherFilter)
      .select(col("value") +: projections: _*)
      .filter(filter).filter(otherFilter)
    
    result.explain
    ```
    
    The final `explain` call hangs [here](https://github.com/apache/spark/blob/db0ddce523bb823cba996e92ef36ceca31492d2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L325-L328).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    > Not really. Constraint propagation will still be enabled by default in Spark. The flag would just be a hammer to quickly get around issues like this and SPARK-17733.
    
    Yeah, of course. I meant that when you disable the flag, you would enjoy the optimization relying on constraint propagation.
    
    I will create another PR for this option.
    
    > I'll take a closer look at this patch but given that this PR is primarily introducing a data structure that keeps track of aliased constraints, is there a fundamental reason for changing the underlying behavior (and restricting the optimization space)? Can there be a simpler alternative where we can still keep the old semantics?
    
    I don't find an alternative fixing to keep the old semantics and not change the propagation structure, and also can largely improve performance at the same time.
    
    #16785 keeps the old semantics and not change the propagation structure, but it just can cut half of the running time regarding the benchmark.
    
    Adding the flag is one simpler option.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73174/testReport)** for PR 16998 at commit [`6cb896f`](https://github.com/apache/spark/commit/6cb896ff062c0a0c46f6f6ac4b88fad165eeaac0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73316/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @hvanhovell Yes. #16785 only does a limited improvement. Both #16785 and this are non-parallel approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73159/testReport)** for PR 16998 at commit [`917de74`](https://github.com/apache/spark/commit/917de74db0066f015ac814125f5cb2d85b7a5b85).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73174/testReport)** for PR 16998 at commit [`6cb896f`](https://github.com/apache/spark/commit/6cb896ff062c0a0c46f6f6ac4b88fad165eeaac0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73316/testReport)** for PR 16998 at commit [`5be21b3`](https://github.com/apache/spark/commit/5be21b32d5b4e3e36e50317a385a554206967668).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @hvanhovell Do you have any thoughts on this already? Please let me know. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73316/testReport)** for PR 16998 at commit [`5be21b3`](https://github.com/apache/spark/commit/5be21b32d5b4e3e36e50317a385a554206967668).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73158/testReport)** for PR 16998 at commit [`24fb723`](https://github.com/apache/spark/commit/24fb723207d80a7c6068fd113430488b89ed9d0b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    > As we use constraints in optimization, if we turn off constraint inference/propagation, wouldn't it miss optimization chance for query plans?
    
    Not really. Constraint propagation will still be enabled by default in Spark. The flag would just be a hammer to quickly get around issues like this and SPARK-17733.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73159/testReport)** for PR 16998 at commit [`917de74`](https://github.com/apache/spark/commit/917de74db0066f015ac814125f5cb2d85b7a5b85).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @viirya please correct me if I'm wrong but scanning through this patch, it appears that the underlying problem is that duplicating and tracking aliased constraints using a `Set` tends to blow up quickly (causing regressions) and this patch is proposing an alternate data structure (`aliasedExpressionsInConstraints`) to keep track of aliases? For e.g., in your example where `a > b`, and `a` is aliased to `c` and `d`, we currently track constraints as `Set(a > b, c > b, d > b)` whereas you'd like it to be tracked as `Set(a > b)` and `Map(a, Set(c, d))`? Is that correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    > Another issue is we actually don't need the additional constraints at most of time. For example, if there is a constraint "a > b", and "a" is aliased to "c" and "d". When we use this constraint in filtering, we don't need all constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is false, it is guaranteed that all other constraints are false too.
    
    I do not get your points. What does this mean? Constraint propagation is a bottom up mechanism for inferring the constraints. Can you elaborate your idea in the more formal way.
    
    I did not read the code. Just wondering if we could miss the chance of plan optimization after this PR? What is the negative impact, if exists?
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @sameeragarwal That's correct.
    
    > By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems.
    
    As we use constraints in optimization, if we turn off constraint inference/propagation, wouldn't it miss optimization chance for query plans?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @viirya I'll take a closer look at this patch but given that this PR is primarily introducing a data structure that keeps track of aliased constraints, is there a fundamental reason for changing the underlying behavior (and restricting the optimization space)? Can there be a simpler alternative where we can still keep the old semantics?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73159/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73158/testReport)** for PR 16998 at commit [`24fb723`](https://github.com/apache/spark/commit/24fb723207d80a7c6068fd113430488b89ed9d0b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    **[Test build #73163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73163/testReport)** for PR 16998 at commit [`d691c66`](https://github.com/apache/spark/commit/d691c66dd0092ad99751964a6b079193706b953c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73163/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    Once we've added the flag, this issue is not urgent for now. I close first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16998
  
    @hvanhovell Do you have time to review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org