You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2015/12/23 10:32:42 UTC

[GitHub] spark pull request: [SPARK-6624][WIP] Another alternative version ...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/10444

    [SPARK-6624][WIP] Another alternative version of CNF normalization

    This PR is a draft version of another alternative of CNF normalization based on [comment][1] in PR #8200. This PR doesn't include test cases, and is only for further discussion.
    
    In this version, CNF normalization is implemented as a separate function `Predicate.toCNF`, which accepts an optional expansion threshold to prevent exponential explosion. The motivation behind this design is that, CNF normalization itself can be useful in other use cases (e.g., eliminating common predicate factors). It would be convenient if we can call it from anywhere without involving the optimizer.
    
    Another consideration is that, if no expansion threshold is provided, `toCNF` should always return a predicate that is really in CNF. That's why a new `RuleExecutor` strategy `FixPoint.Unlimited` is added.
    
    [1]: https://github.com/apache/spark/pull/8200/files#r48328448

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark cnf-draft

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10444
    
----
commit 0fb4beb99dc80e55cbd5ecba8cf0a87fc21b9b86
Author: Cheng Lian <li...@databricks.com>
Date:   2015-12-23T09:17:16Z

    Draft version of CNF normalization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166867696
  
    **[Test build #48236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48236/consoleFull)** for PR 10444 at commit [`031278f`](https://github.com/apache/spark/commit/031278f9b8e8177640b3b6728a058eb153b9a2b4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166852843
  
    cc @nongli @rxin @marmbrus @yjshen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10444#discussion_r48332590
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ---
    @@ -47,6 +48,34 @@ trait Predicate extends Expression {
       override def dataType: DataType = BooleanType
     }
     
    +object Predicate extends PredicateHelper {
    +  def toCNF(predicate: Expression, maybeThreshold: Option[Double] = None): Expression = {
    +    val cnf = new CNFExecutor(predicate).execute(predicate)
    +    val threshold = maybeThreshold.map(predicate.size * _).getOrElse(Double.MaxValue)
    +    if (cnf.size > threshold) predicate else cnf
    --- End diff --
    
    When the threshold is exceeded, the original predicate rather than the intermediate converted predicate is returned. This because the intermediate result may not be in CNF, thus:
    
    1. It doesn't bring much benefit for filter push-down, and
    2. It's much larger than the original predicate and brings extra evaluation cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10444#discussion_r48503899
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ---
    @@ -47,6 +48,34 @@ trait Predicate extends Expression {
       override def dataType: DataType = BooleanType
     }
     
    +object Predicate extends PredicateHelper {
    +  def toCNF(predicate: Expression, maybeThreshold: Option[Double] = None): Expression = {
    +    val cnf = new CNFExecutor(predicate).execute(predicate)
    +    val threshold = maybeThreshold.map(predicate.size * _).getOrElse(Double.MaxValue)
    +    if (cnf.size > threshold) predicate else cnf
    --- End diff --
    
    I disagree with 1. I don't see why it matters if it is all CNF or none. I think the heuristic we want is something like "maximize the number of simple predicates that are in CNF form". Simple here means contains just 1 attribute or binary predicate between two. These are candidates for benefiting from further optimization. 
    
    We could try cost basing this or just stopping the expansion after some amount.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166850088
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48235/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Another alternative version ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10444#discussion_r48332063
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ---
    @@ -47,6 +48,34 @@ trait Predicate extends Expression {
       override def dataType: DataType = BooleanType
     }
     
    +object Predicate extends PredicateHelper {
    +  def toCNF(predicate: Expression, maybeThreshold: Option[Double] = None): Expression = {
    +    val cnf = new CNFExecutor(predicate).execute(predicate)
    +    val threshold = maybeThreshold.map(predicate.size * _).getOrElse(Double.MaxValue)
    +    if (cnf.size > threshold) predicate else cnf
    +  }
    +
    +  private class CNFNormalization(input: Expression)
    +    extends Rule[Expression] {
    +
    +    override def apply(tree: Expression): Expression = {
    +      import org.apache.spark.sql.catalyst.dsl.expressions._
    +
    +      tree transformDown {
    +        case Not(Not(e)) => e
    +        case Not(a And b) => !a || !b
    +        case Not(a Or b) => !a && !b
    +        case a Or (b And c) => (a || b) && (a && c)
    +        case (a And b) Or c => (a || c) && (b || c)
    +      }
    +    }
    +  }
    +
    +  private class CNFExecutor(input: Expression) extends RuleExecutor[Expression] {
    +    override protected val batches: Seq[Batch] =
    +      Batch("CNFNormalization", FixedPoint.Unlimited, new CNFNormalization(input)) :: Nil
    --- End diff --
    
    `FixedPoint.Unlimited` is used here to guarantee that we can really reach CNF when no expansion threshold is provided. This should be safe since all the rules defined here converge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng closed the pull request at:

    https://github.com/apache/spark/pull/10444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166891798
  
    **[Test build #48238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48238/consoleFull)** for PR 10444 at commit [`6e2c052`](https://github.com/apache/spark/commit/6e2c0524723b6689e17550dad2b515a6e7602cb4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166891883
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48238/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166891882
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10444#discussion_r48516709
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ---
    @@ -47,6 +48,34 @@ trait Predicate extends Expression {
       override def dataType: DataType = BooleanType
     }
     
    +object Predicate extends PredicateHelper {
    +  def toCNF(predicate: Expression, maybeThreshold: Option[Double] = None): Expression = {
    +    val cnf = new CNFExecutor(predicate).execute(predicate)
    +    val threshold = maybeThreshold.map(predicate.size * _).getOrElse(Double.MaxValue)
    +    if (cnf.size > threshold) predicate else cnf
    --- End diff --
    
    Maximizing the number of simple predicates sounds reasonable. We may do the conversion in a depth-first manner, i.e. always convert the left branch of an `And` and then its right branch, until either no more predicates can be converted or we reach the size limit. In this way the intermediate result is still useful.
    
    BTW, searched for CNF conversion in Hive and found [HIVE-9166][1], which also tries to put an upper limit for ORC SARG CNF conversion. @nongli Any clues about how Impala does this?
    
    [1]: https://issues.apache.org/jira/browse/HIVE-9166


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166868043
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166850081
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166870843
  
    **[Test build #48238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48238/consoleFull)** for PR 10444 at commit [`6e2c052`](https://github.com/apache/spark/commit/6e2c0524723b6689e17550dad2b515a6e7602cb4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166868044
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48236/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Another alternative version ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10444#discussion_r48331968
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -583,6 +583,12 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
       }
     }
     
    +object CNFNormalization extends Rule[LogicalPlan] {
    +  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case f @ Filter(condition, _) => f.copy(condition = Predicate.toCNF(condition, Some(10)))
    --- End diff --
    
    Apparently, the expansion threshold can be made a configuration option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6624][WIP] Draft of another alternative...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10444#issuecomment-166850935
  
    **[Test build #48236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48236/consoleFull)** for PR 10444 at commit [`031278f`](https://github.com/apache/spark/commit/031278f9b8e8177640b3b6728a058eb153b9a2b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org