You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/02/03 07:27:55 UTC

[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16785

    [SPARK-19443][SQL] The function to generate constraints takes too long when the query plan grows continuously

    ## What changes were proposed in this pull request?
    
    This issue is originally reported and discussed at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html#a20821
    
    When run a ML `Pipeline` with many stages, during the iterative updating to `Dataset` , it is observed the it takes longer time to finish the fit and transform as the query plan grows continuously.
    
    The example code show as the following in benchmark.
    
    Specially, the time spent on preparing optimized plan in current branch (74294 ms) is much higher than 1.6 (292 ms). Actually, the time is spent mostly on generating query plan's constraints during few optimization rules.
    
    `getAliasedConstraints` is found to be a function costing most of the running time.
    
    This patch tries to rewrite `getAliasedConstraints`. After this patch, the time to preparing optimized plan is reduced significantly from 74294 ms to 2573 ms.
    
    ### Benchmark
    
    Run the following codes locally.
    
        import org.apache.spark.ml.{Pipeline, PipelineStage}
        import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
    
        val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
    
        val indexers = df.columns.tail.map(c => new StringIndexer()
          .setInputCol(c)
          .setOutputCol(s"${c}_indexed")
          .setHandleInvalid("skip"))
    
        val encoders = indexers.map(indexer => new OneHotEncoder()
          .setInputCol(indexer.getOutputCol)
          .setOutputCol(s"${indexer.getOutputCol}_encoded")
          .setDropLast(true))
    
        val stages: Array[PipelineStage] = indexers ++ encoders
        val pipeline = new Pipeline().setStages(stages)
    
        pipeline.fit(df).transform(df).show
    
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 improve-constraints-generation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16785.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16785
    
----
commit b4e514ade7ea478055db448bbf66f7a88caf3a86
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-02-03T07:08:47Z

    Improve the code to generate constraints.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    cc @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73034/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL][WIP] The function to generate constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #72625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72625/testReport)** for PR 16785 at commit [`8c98a5c`](https://github.com/apache/spark/commit/8c98a5c3ab1477408988c8cb682733e65dd554fc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    @hvanhovell Yeah, I think so. As in previous comment, I don't find a way now to improve `getAliasedConstraints` significantly by re-writing its logic without parallel collection.
    
    We may consider #16775 which is an another solution to fix this issue by checkpointing datasets for pipelines of long stages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #73035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73035/testReport)** for PR 16785 at commit [`278c31c`](https://github.com/apache/spark/commit/278c31cf8aa27c71e0f5178bebcb426ec5fba6ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16785#discussion_r100364260
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -314,19 +322,29 @@ abstract class UnaryNode extends LogicalPlan {
        * expressions with the corresponding alias
        */
       protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = {
    -    var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
    -    projectList.foreach {
    -      case a @ Alias(e, _) =>
    -        // For every alias in `projectList`, replace the reference in constraints by its attribute.
    -        allConstraints ++= allConstraints.map(_ transform {
    -          case expr: Expression if expr.semanticEquals(e) =>
    -            a.toAttribute
    -        })
    -        allConstraints += EqualNullSafe(e, a.toAttribute)
    -      case _ => // Don't change.
    -    }
    -
    -    allConstraints -- child.constraints
    +    val relativeReferences = AttributeSet(projectList.collect {
    +      case a: Alias => a
    +    }.flatMap(_.references))
    +    val parAllConstraints = child.constraints.asInstanceOf[Set[Expression]].filter { constraint =>
    +      constraint.references.intersect(relativeReferences).nonEmpty
    +    }.par
    +    parAllConstraints.tasksupport = UnaryNode.taskSupport
    --- End diff --
    
    Why are we using a custom task support instead of the default (which uses the global fork-join executor)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/16785


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    > this looks like a very big hammer to solve this problem. Can't we try a different approach?
    I think we should try to avoid optimizing already optimized code snippets, you might be able to do this using some kind of a fence. It would even be better if we would have a recursive node.
    
    @cloud-fan @hvanhovell Ok. I've figured out to add a check to reduce the candidates of aliased constraints. It can achieve same speed-up (cut of half running time in benchmark) without parallel collection hammer.
    
    Can you have time to look at it? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL][WIP] The function to generate constra...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    I don't find a way to improve `getAliasedConstraints` significantly by re-writing its logic. The current way to improve its performance is using parallel collection to do the transformation in parallel. It can cut the running time by half (see benchmark in the pr description), but the running time (13.5 secs) is still too long compared with 1.6.
    
    We may consider #16775 which is an another solution to fix this issue by checkpointing datasets for pipelines of long stages, or both of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72303/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #72305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72305/testReport)** for PR 16785 at commit [`b4e514a`](https://github.com/apache/spark/commit/b4e514ade7ea478055db448bbf66f7a88caf3a86).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #72305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72305/testReport)** for PR 16785 at commit [`b4e514a`](https://github.com/apache/spark/commit/b4e514ade7ea478055db448bbf66f7a88caf3a86).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL][WIP] The function to generate constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #72625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72625/testReport)** for PR 16785 at commit [`8c98a5c`](https://github.com/apache/spark/commit/8c98a5c3ab1477408988c8cb682733e65dd554fc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72305/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16785#discussion_r102361204
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -314,7 +314,17 @@ abstract class UnaryNode extends LogicalPlan {
        * expressions with the corresponding alias
        */
       protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = {
    -    var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
    +    val relativeReferences = AttributeSet(projectList.collect {
    +      case a: Alias => a
    +    }.flatMap(_.references)) ++ outputSet
    +
    +    // We only care about the constraints which refer to attributes in output and aliases.
    +    // For example, for a constraint 'a > b', if 'a' is aliased to 'c', we need to get aliased
    +    // constraint 'c > b' only if 'b' is in output.
    +    var allConstraints = child.constraints.filter { constraint =>
    +      constraint.references.subsetOf(relativeReferences)
    --- End diff --
    
    Hi @viirya, I understand that https://github.com/apache/spark/pull/16998 probably supersedes this, but just out of curiosity, did you see a lot of benefit from pruning these attributes here given that we already prune them later in `QueryPlan`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    can we consider this in a higher-level of view instead of focusing on the method `getAliasedConstraints`? Maybe there is a way to do constraint propagation faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #73035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73035/testReport)** for PR 16785 at commit [`278c31c`](https://github.com/apache/spark/commit/278c31cf8aa27c71e0f5178bebcb426ec5fba6ce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72625/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #72303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72303/testReport)** for PR 16785 at commit [`b4e514a`](https://github.com/apache/spark/commit/b4e514ade7ea478055db448bbf66f7a88caf3a86).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    @viirya this looks like a very big hammer to solve this problem. Can't we try a different approach?
    
    I think we should try to avoid optimizing already optimized code snippets, you might be able to do this using some kind of a fence. It would even be better if we would have a recursive node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #73034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73034/testReport)** for PR 16785 at commit [`4ba93fe`](https://github.com/apache/spark/commit/4ba93fecfcbdeebecc9526a90b5800c98b3f35a7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    **[Test build #73034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73034/testReport)** for PR 16785 at commit [`4ba93fe`](https://github.com/apache/spark/commit/4ba93fecfcbdeebecc9526a90b5800c98b3f35a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL][WIP] The function to generate constra...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    The rewritten logic is not correct. I am working to improve this with other approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL][WIP] The function to generate constra...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    since this change is related to SQL, cc @cloud-fan @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    @cloud-fan yeah, i agreed with you and @hvanhovell.
    
    For too slow constraint propagation, except for attacking `getAliasedConstraints` like this change, maybe we can have other way to improve the process doing constraint propagation.
    
    If we can't, for such long lineages, I think we should use checkpointing to fix it like #16775.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73035/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16785
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16785#discussion_r102364604
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -314,7 +314,17 @@ abstract class UnaryNode extends LogicalPlan {
        * expressions with the corresponding alias
        */
       protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = {
    -    var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
    +    val relativeReferences = AttributeSet(projectList.collect {
    +      case a: Alias => a
    +    }.flatMap(_.references)) ++ outputSet
    +
    +    // We only care about the constraints which refer to attributes in output and aliases.
    +    // For example, for a constraint 'a > b', if 'a' is aliased to 'c', we need to get aliased
    +    // constraint 'c > b' only if 'b' is in output.
    +    var allConstraints = child.constraints.filter { constraint =>
    +      constraint.references.subsetOf(relativeReferences)
    --- End diff --
    
    Yes. You can see the benchmark in the pr description. With pruning these attributes, the running time is cut half.
    
    If I understand your comment correctly, pruning them later in `QueryPlan` means we prune constraints which don't refer attributes in `outputSet`.
    
    But the pruning here is happened before the pruning you pointed out, we need to reduce the constraints taken for transforming aliasing attributes to lower the computation cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16785#discussion_r100484586
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -314,19 +322,29 @@ abstract class UnaryNode extends LogicalPlan {
        * expressions with the corresponding alias
        */
       protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = {
    -    var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
    -    projectList.foreach {
    -      case a @ Alias(e, _) =>
    -        // For every alias in `projectList`, replace the reference in constraints by its attribute.
    -        allConstraints ++= allConstraints.map(_ transform {
    -          case expr: Expression if expr.semanticEquals(e) =>
    -            a.toAttribute
    -        })
    -        allConstraints += EqualNullSafe(e, a.toAttribute)
    -      case _ => // Don't change.
    -    }
    -
    -    allConstraints -- child.constraints
    +    val relativeReferences = AttributeSet(projectList.collect {
    +      case a: Alias => a
    +    }.flatMap(_.references))
    +    val parAllConstraints = child.constraints.asInstanceOf[Set[Expression]].filter { constraint =>
    +      constraint.references.intersect(relativeReferences).nonEmpty
    +    }.par
    +    parAllConstraints.tasksupport = UnaryNode.taskSupport
    --- End diff --
    
    Do they have the same parallelism level? BTW, I saw the parallel collection used in other places in Spark all take custom task support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org