You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2015/06/02 05:56:41 UTC

[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/6570

    [SPARK-8023] [SQL] Add a deterministic method to Expression.

    https://issues.apache.org/jira/browse/SPARK-8023

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark SPARK-8023

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6570.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6570
    
----
commit f9d6a7387fa130b21e8e4f7dc489e3afd1534554
Author: Yin Huai <yh...@databricks.com>
Date:   2015-06-02T03:50:45Z

    Add a deterministic method to Expression.

commit e38f264db24405706c1f70d6278b1e73d13f9555
Author: Yin Huai <yh...@databricks.com>
Date:   2015-06-02T03:56:20Z

    Comment.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107830906
  
      [Test build #33955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33955/consoleFull) for   PR 6570 at commit [`da56200`](https://github.com/apache/spark/commit/da56200bc077d6564de53a8959c99768d0aa790c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107801246
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107819631
  
      [Test build #33953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33953/consoleFull) for   PR 6570 at commit [`e38f264`](https://github.com/apache/spark/commit/e38f264db24405706c1f70d6278b1e73d13f9555).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107794383
  
      [Test build #33953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33953/consoleFull) for   PR 6570 at commit [`e38f264`](https://github.com/apache/spark/commit/e38f264db24405706c1f70d6278b1e73d13f9555).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107819661
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31491864
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ---
    @@ -37,6 +37,13 @@ abstract class Expression extends TreeNode[Expression] {
        *  - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable
        */
       def foldable: Boolean = false
    +
    +  /**
    +   * Returns true when an expressions always return the same result for a specific set of
    +   * input values.
    +   */
    +  // TODO: Need to well define what are explicit input values and implicit input values.
    +  def deterministic: Boolean = true
    --- End diff --
    
    Shouldn't this only be true if all the children are `deterministic` (if there are any).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31491973
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -179,8 +179,17 @@ object ColumnPruning extends Rule[LogicalPlan] {
      * expressions into one single expression.
      */
     object ProjectCollapsing extends Rule[LogicalPlan] {
    +
    +  /** Returns true if any expression in projectList is non-deterministic. */
    +  private def hasNondeterministic(projectList: Seq[NamedExpression]): Boolean = {
    +    projectList.exists(expr => expr.find(!_.deterministic).isDefined)
    --- End diff --
    
    This will match rand() + 1. `find` will traverse the expression tree in pre-order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31492021
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ---
    @@ -37,6 +37,13 @@ abstract class Expression extends TreeNode[Expression] {
        *  - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable
        */
       def foldable: Boolean = false
    +
    +  /**
    +   * Returns true when an expressions always return the same result for a specific set of
    +   * input values.
    +   */
    +  // TODO: Need to well define what are explicit input values and implicit input values.
    +  def deterministic: Boolean = true
    --- End diff --
    
    I think it depends on how you interpret this. The definition here is whether this expression itself is deterministic or not (i.e. assuming fixed input, is the output deterministic?)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31492468
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
    @@ -453,6 +454,44 @@ class ColumnExpressionSuite extends QueryTest {
           assert(row.getDouble(1) <= 1.0)
           assert(row.getDouble(1) >= 0.0)
         }
    +
    +    def checkNumProjects(df: DataFrame, expectedNumProjects: Int): Unit = {
    +      val projects = df.queryExecution.executedPlan.collect {
    +        case project: Project => project
    +      }
    +      assert(projects.size === expectedNumProjects)
    +    }
    +
    +    // We first create a plan with two Projects.
    +    // Project [rand + 1 AS rand1, rand - 1 AS rand2]
    +    //   Project [key, Rand 5 AS rand]
    +    //     LogicalRDD [key, value]
    +    // Because Rand function is not deterministic, the column rand is not deterministic.
    +    // So, in the optimizer, we will not collapse Project [rand + 1 AS rand1, rand - 1 AS rand2]
    +    // and Project [key, Rand 5 AS rand]. The final plan still has two Projects.
    +    val dfWithTwoProjects =
    +      testData
    +        .select('key, rand(5L).as("rand"))
    --- End diff --
    
    let's make the base 
    
    (rand(5L) + 1)
    
    to make sure we can catch nested nondeterministic things.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107794328
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31491846
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
    @@ -453,6 +454,44 @@ class ColumnExpressionSuite extends QueryTest {
           assert(row.getDouble(1) <= 1.0)
           assert(row.getDouble(1) >= 0.0)
         }
    +
    +    def checkNumProjects(df: DataFrame, expectedNumProjects: Int): Unit = {
    +      val projects = df.queryExecution.executedPlan.collect {
    +        case project: Project => project
    +      }
    +      assert(projects.size === expectedNumProjects)
    +    }
    +
    +    // We first create a plan with two Projects.
    +    // Project [rand + 1 AS rand1, rand - 1 AS rand2]
    +    //   Project [key, Rand 5 AS rand]
    +    //     LogicalRDD [key, value]
    +    // Because Rand function is not deterministic, the column rand is not deterministic.
    +    // So, in the optimizer, we will not collapse Project [rand + 1 AS rand1, rand - 1 AS rand2]
    +    // and Project [key, Rand 5 AS rand]. The final plan still has two Projects.
    +    val dfWithTwoProjects =
    +      testData
    +        .select('key, rand(5L).as("rand"))
    --- End diff --
    
    please use $"..." to be consistent with rest of this file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai closed the pull request at:

    https://github.com/apache/spark/pull/6570


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31494578
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ---
    @@ -37,6 +37,13 @@ abstract class Expression extends TreeNode[Expression] {
        *  - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable
        */
       def foldable: Boolean = false
    +
    +  /**
    +   * Returns true when an expressions always return the same result for a specific set of
    +   * input values.
    +   */
    +  // TODO: Need to well define what are explicit input values and implicit input values.
    +  def deterministic: Boolean = true
    --- End diff --
    
    Oh I understand the TODO now... It seems better be holistic about expressing an expression's determinism. Otherwise, you leave the burden of tree traversal to the user, which could be more error prone.  But I agree that either could work given proper documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107801198
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107830931
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107794318
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31491862
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -179,8 +179,17 @@ object ColumnPruning extends Rule[LogicalPlan] {
      * expressions into one single expression.
      */
     object ProjectCollapsing extends Rule[LogicalPlan] {
    --- End diff --
    
    ideally we should have a unit test on the optimizer rule itself also


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6570#discussion_r31491837
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -179,8 +179,17 @@ object ColumnPruning extends Rule[LogicalPlan] {
      * expressions into one single expression.
      */
     object ProjectCollapsing extends Rule[LogicalPlan] {
    +
    +  /** Returns true if any expression in projectList is non-deterministic. */
    +  private def hasNondeterministic(projectList: Seq[NamedExpression]): Boolean = {
    +    projectList.exists(expr => expr.find(!_.deterministic).isDefined)
    --- End diff --
    
    this won't match on rand() + 1 will this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107794381
  
    @marmbrus @rxin @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8023] [SQL] Add a deterministic method ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6570#issuecomment-107801325
  
      [Test build #33955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33955/consoleFull) for   PR 6570 at commit [`da56200`](https://github.com/apache/spark/commit/da56200bc077d6564de53a8959c99768d0aa790c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org