You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dilipbiswal <gi...@git.apache.org> on 2017/03/17 06:58:49 UTC

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/17330

    [SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work.

    ## What changes were proposed in this pull request?
    The sameResult() method does not work when the logical plan contains subquery expressions.
    
    Before the fix
    ```SQL
    scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)")
    ds: org.apache.spark.sql.DataFrame = [c1: int]
    
    scala> ds.cache
    res13: ds.type = [c1: int]
    
    scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true)
    == Analyzed Logical Plan ==
    c1: int
    Project [c1#86]
    +- Filter c1#86 IN (list#78 [c1#86])
       :  +- Project [c1#87]
       :     +- Filter (outer(c1#86) = c1#87)
       :        +- SubqueryAlias s2
       :           +- Relation[c1#87] parquet
       +- SubqueryAlias s1
          +- Relation[c1#86] parquet
    
    == Optimized Logical Plan ==
    Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
    :- Relation[c1#86] parquet
    +- Relation[c1#87] parquet
    ```
    Plan after fix
    ```SQL
    == Analyzed Logical Plan ==
    c1: int
    Project [c1#22]
    +- Filter c1#22 IN (list#14 [c1#22])
       :  +- Project [c1#23]
       :     +- Filter (outer(c1#22) = c1#23)
       :        +- SubqueryAlias s2
       :           +- Relation[c1#23] parquet
       +- SubqueryAlias s1
          +- Relation[c1#22] parquet
    
    == Optimized Logical Plan ==
    InMemoryRelation [c1#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
       +- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight
          :- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
          +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295))))
             +- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
    ```
    ## How was this patch tested?
    New tests are added to CachedTableSuite.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark subquery_cache_final

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17330
    
----
commit dfa76dafdfaabee7240adbaaacb101e484418013
Author: Dilip Biswal <db...@us.ibm.com>
Date:   2017-03-03T10:12:56Z

    [SPARK-19993] Caching logical plans containing subquery expressions does not work

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107077129
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    @cloud-fan Hi Wenchen, let me take an example and perhaps that will help us.
    
    Lets assume there is a Filter operator that has 3 subquery expressions like following.
    ``` scala
    Filter Scalar-subquery(.., splan1,..) || Exists(.., splan2, ..) || In(.., ListQuery(.., splan3, ..)
    ```
    
    1. During sameResult on this plan, we perform this [logic](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L367-L375)
    2. We clean the args from both source and target plans.
       2.1 We change the subquery expression to CanonicalizedSubqueryExpression so that we can do a customized equals that basically does a semantic equals of two subquery expressions. 
       2.2 Additionally we change the subquery plan (splan1, splan2, splan3) to change the outer attribute references to BindReference. Since these are outer references the inner plan is not aware of these attributes and hence we fix them as part of canonicalize. After this, when the equals method is called on CanonicalizedSubqueryExpression [here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L372) and the sameResult is called on splan1, splan2, splan3 it works 
     fine as the outer references have been normalized to BindReference. The local attributes referenced in the plan is handled in sameResult itself.
       2.3 As part of cleanArgs, after cleaning the expressions we call canonicalized [here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L402). So here we try to re-order 3 CanonicalizedSubqueryExpressions on the basis of their hashCode. When i had the hashCode call expr.semanticHash(), i observed that the ordering of the expressions between source and target is not consistent. I debugged to find that expr.semanticHash() considers the subquery plans as they are the class arguments and since plans have arbitary expression ids (we have only normalized the outer references.. not all of them at this point). Thus, in the current implementation i am excluding the subplans while computing the hashcode and have marked it as a todo.
    
    Please let me know your thoughts and let me know if i have misunderstood any thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795772
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,19 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    --- End diff --
    
    @gatorsmile OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #74775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74775/testReport)** for PR 17330 at commit [`2b9d717`](https://github.com/apache/spark/commit/2b9d717be385b8aab2f720047cc248ed2fac6bb8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110671091
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -236,6 +244,12 @@ case class ScalarSubquery(
       override def nullable: Boolean = true
       override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan = plan)
       override def toString: String = s"scalar-subquery#${exprId.id} $conditionString"
    +  override lazy val canonicalized: Expression = {
    --- End diff --
    
    oh sorry it's expression


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL][WIP] Caching logical plans con...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106758290
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,36 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends LeafExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val state = Seq(children, this.getClass.getName)
    +    state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
    --- End diff --
    
    can we write this imperatively


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75710/testReport)** for PR 17330 at commit [`362d62f`](https://github.com/apache/spark/commit/362d62ff393954d37d76ac55636d50ee0b4ffcb5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107332590
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    --- End diff --
    
    @viirya Thank you very much. Let me try this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107395654
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,20 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions. This method returns a [[CanonicalizedSubqueryExpr]]
    +   * which wraps the underlying [[SubqueryExpression]].
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)
    --- End diff --
    
    here shall we bind reference for other `Attribute` in sub query?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795228
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,19 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    --- End diff --
    
    Also replace `SubqueryExpression ` by `CanonicalizedSubqueryExpr`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107614291
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    --- End diff --
    
    Sounds reasonable. The benefit of this expression approach is expression canonicalization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)** for PR 17330 at commit [`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74729/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110671334
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -59,6 +58,13 @@ abstract class SubqueryExpression(
             children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
         case _ => false
       }
    +  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = plan.transformAllExpressions {
    --- End diff --
    
    `canonicalizedPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110590906
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala ---
    @@ -401,8 +401,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT
        * do not use `BindReferences` here as the plan may take the expression as a parameter with type
        * `Attribute`, and replace it with `BoundReference` will cause error.
        */
    -  protected def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {
    +  def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {
    --- End diff --
    
    @cloud-fan Ok. Sounds good wenchen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106840346
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    @cloud-fan Hi wenchen, we are unable to use expr.semanticHash here. I had actually tried to do that. So when we canonicalize the expressions we are re-ordering them based on their hash code and that creates problem. I have a test case that has three subquery expressions (ScalarSubquery, ListQuery and Exists) and while re-ordering them the order become unpredictable. So in here when i generate the hashcode, i am generating them conservatively. 
    Let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110671813
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -59,6 +58,13 @@ abstract class SubqueryExpression(
             children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
         case _ => false
       }
    +  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = plan.transformAllExpressions {
    +      case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)
    --- End diff --
    
    The `OuterReference` will all be removed, is it expected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74772/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    cc @hvanhovell @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107494062
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,20 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions. This method returns a [[CanonicalizedSubqueryExpr]]
    +   * which wraps the underlying [[SubqueryExpression]].
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)
    --- End diff --
    
    @cloud-fan The local attributes would be bound when we call sameResult on the subquery plan ? We certainly can bind them here and if you don't mind, i wanted to explore this in a follow up in an effort to start considering the sub-plan in the hashCode computation (marked as a todo now in comment) as i wanted more testing on that. Please let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110885627
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -670,4 +677,139 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
           assert(spark.read.parquet(path).filter($"id" > 4).count() == 15)
         }
       }
    +
    +  test("SPARK-19993 simple subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |NOT EXISTS (SELECT * FROM t1)
    +        """.stripMargin).cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +
    +      // Additional predicate in the subquery plan should cause a cache miss
    +      val cachedMissDs =
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |NOT EXISTS (SELECT * FROM t1 where c1 = 0)
    +        """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedMissDs) == 0)
    +    }
    +  }
    +
    +  test("SPARK-19993 subquery caching with correlated predicates") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      // Simple correlated predicate in subquery
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +        """.stripMargin).cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +    }
    +  }
    +
    +  test("SPARK-19993 subquery with cached underlying relation") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    --- End diff --
    
    where is `t2` used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107343122
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    --- End diff --
    
    @viirya Thank you for your suggestion. I thought about this and went through my notes i had prepared on this while debugging this. The reason i had opted for comparing subquery expressions as opposed to just the plans is i wanted take advantage of expr.canonicalized which re-orders expression nicely to maximize cache hit. 
    Example - plan1 - Filter In || Exists || Scalar
                      plan2- Filter Scalar | In | Exists
    When we compare the above two plans .. all other things being equal should cause a cache hit. I added a test case now to make sure. One other aspect is in the future the subquery expression may evolve to hold more attributes and not considering them didn't feel safe. The other thing is i suspect we may still have to deal with the outer references in the same way i am handling now. Please let me know your thoughts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795294
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
         }
       }
     
    +  test("SPARK-19993 subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      val ds1 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(ds1) == 0)
    +
    +      ds1.cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +
    +      // Additional predicate in the subquery plan should cause a cache miss
    +      val cachedMissDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1 where c1 = 0)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedMissDs) == 0)
    +
    +      // Simple correlated predicate in subquery
    +      val ds2 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(ds2) == 0)
    +
    +      ds2.cache()
    +
    +      val cachedDs2 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +
    +      assert(getNumInMemoryRelations(cachedDs2) == 1)
    +
    +      spark.catalog.cacheTable("t1")
    +      ds1.unpersist()
    --- End diff --
    
    How about splitting the test cases to multiple individual ones? The cache will be cleaned for each test case. This can avoid any extra checking, like `assert(getNumInMemoryRelations(cachedMissDs) == 0)` or `ds1.unpersist()`
    ```Scala
      override def afterEach(): Unit = {
        try {
          spark.catalog.clearCache()
        } finally {
          super.afterEach()
        }
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106866491
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    why the reordering is unpredictable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75647/testReport)** for PR 17330 at commit [`7346dca`](https://github.com/apache/spark/commit/7346dca9e21347df888d7f7bdc7753bb12670e00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #74729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74729/testReport)** for PR 17330 at commit [`dfa76da`](https://github.com/apache/spark/commit/dfa76dafdfaabee7240adbaaacb101e484418013).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795081
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,19 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    --- End diff --
    
    Nit: `QueryPlan.sameResult`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75664/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    @cloud-fan Sure Wenchen. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #74813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74813/testReport)** for PR 17330 at commit [`0c64245`](https://github.com/apache/spark/commit/0c6424574282db0e865ef4ad14c5a9179fcf27ef).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110885997
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -76,6 +76,13 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
         sum
       }
     
    +  private def getNumInMemoryTableScanExecs(plan: SparkPlan): Int = {
    --- End diff --
    
    we need a better name, this actually get in-memory table recursively, which is different from `getNumInMemoryRelations`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75654/testReport)** for PR 17330 at commit [`22db44a`](https://github.com/apache/spark/commit/22db44addbe1892b52353fa22bd9b65952e8bdf5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110977330
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -670,4 +677,139 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
           assert(spark.read.parquet(path).filter($"id" > 4).count() == 15)
         }
       }
    +
    +  test("SPARK-19993 simple subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |NOT EXISTS (SELECT * FROM t1)
    +        """.stripMargin).cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +
    +      // Additional predicate in the subquery plan should cause a cache miss
    +      val cachedMissDs =
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |NOT EXISTS (SELECT * FROM t1 where c1 = 0)
    +        """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedMissDs) == 0)
    +    }
    +  }
    +
    +  test("SPARK-19993 subquery caching with correlated predicates") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      // Simple correlated predicate in subquery
    +      sql(
    +        """
    +          |SELECT * FROM t1
    +          |WHERE
    +          |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +        """.stripMargin).cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +    }
    +  }
    +
    +  test("SPARK-19993 subquery with cached underlying relation") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    --- End diff --
    
    @cloud-fan sorry... actually i had some of these tests combined and when i split, i forgot to remove this. Will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75654/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110587691
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala ---
    @@ -401,8 +401,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT
        * do not use `BindReferences` here as the plan may take the expression as a parameter with type
        * `Attribute`, and replace it with `BoundReference` will cause error.
        */
    -  protected def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {
    +  def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {
    --- End diff --
    
    oh, seems we can move this method to `object QueryPlan`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)** for PR 17330 at commit [`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110587292
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -59,6 +59,14 @@ abstract class SubqueryExpression(
             children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
         case _ => false
       }
    +
    +  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
    --- End diff --
    
    shall we move this method to `PlanExpression`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    LGTM except some minor comments about test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74775/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107713751
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,20 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions. This method returns a [[CanonicalizedSubqueryExpr]]
    +   * which wraps the underlying [[SubqueryExpression]].
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)
    --- End diff --
    
    @cloud-fan Hi Wenchen, If we are able to normalize all the expression ids, then i think we can solve the nondeterministic problem. I had actually tried this for the new test cases i have added. But i didn't go with this as i felt i may have to test this more.  In my testing, i only normalized the AttributeReference .. but we can have generated expression ids for other expressions, right , Aliases, SubqueryExpression etc. We also need to normalize these as well, no ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795011
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,19 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions.
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, allowFailures = true)
    --- End diff --
    
    Nit: `case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)** for PR 17330 at commit [`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795140
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    +    h * 31 + Objects.hashCode(this.getClass.getName)
    +  }
    +
    +  override def equals(o: Any): Boolean = o match {
    +    case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
    +    case other => false
    --- End diff --
    
    ```Scala
        case CanonicalizedSubqueryExpr(e) => expr.semanticEquals(e)
        case _ => false
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107331403
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    --- End diff --
    
    Do we really need this new abstraction?
    
    Actually I don't think we should compare `SubqueryExpression`s and wonder how to canonicalize them.
    
    Basically, in `sameResult`, by adding new condition `(left.subqueries, right.subqueries).zipped.forall(_ sameResult _)`, we can achieve the same goal and all added tests are passed.
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL][WIP] Caching logical plans con...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106761010
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,36 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends LeafExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val state = Seq(children, this.getClass.getName)
    +    state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
    --- End diff --
    
    @rxin Sure Reynold.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110885501
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -670,4 +677,139 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
           assert(spark.read.parquet(path).filter($"id" > 4).count() == 15)
         }
       }
    +
    +  test("SPARK-19993 simple subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    --- End diff --
    
    where is `t2` used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Hi @dilipbiswal can you update now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107070667
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    after [canonicalize](https://github.com/apache/spark/pull/17330/files#diff-203ac90583cebe29a92c1d812c07f102R392), the ordering should be deterministic, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107869502
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,20 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions. This method returns a [[CanonicalizedSubqueryExpr]]
    +   * which wraps the underlying [[SubqueryExpression]].
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)
    --- End diff --
    
    @cloud-fan Hi Wenchen, I tried to normalize all the expressions in the plan. But still the hashcodes are not stable. On two instances of the same plan i get two different hashcode. Please advice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110590426
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -236,6 +244,12 @@ case class ScalarSubquery(
       override def nullable: Boolean = true
       override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan = plan)
       override def toString: String = s"scalar-subquery#${exprId.id} $conditionString"
    +  override lazy val canonicalized: Expression = {
    --- End diff --
    
    @cloud-fan preCanonicalize is a method in QueryPlan ? Can we override it here ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110693397
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -59,6 +58,13 @@ abstract class SubqueryExpression(
             children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
         case _ => false
       }
    +  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = plan.transformAllExpressions {
    +      case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)
    --- End diff --
    
    @cloud-fan Actually you r right. Preserving the OuterReference would be good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74813/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    @cloud-fan Thanks a lot!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106884365
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    @cloud-fan In my understanding, its because a SubqueryExpression contains a LogicalPlan and the logical plan has all sorts of expression ids that attributes to this unpredictability.  In order to remove this, we may have to implement a canonicalize() on SubqueryExpression that also canonicalizes all the expressions on the subquery plan in a recursive fashion. I had marked it as a TODO in my comment. Please let me know if my understanding is wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #74729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74729/testReport)** for PR 17330 at commit [`dfa76da`](https://github.com/apache/spark/commit/dfa76dafdfaabee7240adbaaacb101e484418013).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110588886
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -236,6 +244,12 @@ case class ScalarSubquery(
       override def nullable: Boolean = true
       override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan = plan)
       override def toString: String = s"scalar-subquery#${exprId.id} $conditionString"
    +  override lazy val canonicalized: Expression = {
    --- End diff --
    
    I think we can just override `preCanonicalize` to turn `exprId` to 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795750
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,19 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions.
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, allowFailures = true)
    --- End diff --
    
    @gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110976069
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -76,6 +76,13 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
         sum
       }
     
    +  private def getNumInMemoryTableScanExecs(plan: SparkPlan): Int = {
    --- End diff --
    
    @cloud-fan So we are operating at the physical plan level in this method where as the other method getNumInMemoryRelations operates at a logical plan level. And in here we are simply counting the the InMemoryTableScanExec nodes in the plan. I have changed the function name to getNumInMemoryTablesRecursively. Does it look ok to you ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r110977254
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -670,4 +677,139 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
           assert(spark.read.parquet(path).filter($"id" > 4).count() == 15)
         }
       }
    +
    +  test("SPARK-19993 simple subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    --- End diff --
    
    @cloud-fan sorry... actually i had some of these tests combined and when i split, i forgot to remove this. Will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75647/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Generally, it looks good to me. cc @hvanhovell @rxin @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75647/testReport)** for PR 17330 at commit [`7346dca`](https://github.com/apache/spark/commit/7346dca9e21347df888d7f7bdc7753bb12670e00).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795788
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
    @@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
         }
       }
     
    +  test("SPARK-19993 subquery caching") {
    +    withTempView("t1", "t2") {
    +      Seq(1).toDF("c1").createOrReplaceTempView("t1")
    +      Seq(1).toDF("c1").createOrReplaceTempView("t2")
    +
    +      val ds1 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(ds1) == 0)
    +
    +      ds1.cache()
    +
    +      val cachedDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedDs) == 1)
    +
    +      // Additional predicate in the subquery plan should cause a cache miss
    +      val cachedMissDs =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |NOT EXISTS (SELECT * FROM t1 where c1 = 0)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(cachedMissDs) == 0)
    +
    +      // Simple correlated predicate in subquery
    +      val ds2 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +      assert(getNumInMemoryRelations(ds2) == 0)
    +
    +      ds2.cache()
    +
    +      val cachedDs2 =
    +        sql(
    +          """
    +            |SELECT * FROM t1
    +            |WHERE
    +            |t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
    +          """.stripMargin)
    +
    +      assert(getNumInMemoryRelations(cachedDs2) == 1)
    +
    +      spark.catalog.cacheTable("t1")
    +      ds1.unpersist()
    --- End diff --
    
    @gatorsmile Sure. will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17330


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #75710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75710/testReport)** for PR 17330 at commit [`362d62f`](https://github.com/apache/spark/commit/362d62ff393954d37d76ac55636d50ee0b4ffcb5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106795763
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    +    h * 31 + Objects.hashCode(this.getClass.getName)
    +  }
    +
    +  override def equals(o: Any): Boolean = o match {
    +    case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
    +    case other => false
    --- End diff --
    
    @gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r106838131
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient expression
    + * that only lives in the scope of sameResult function call. In other words, analyzer,
    + * optimizer or planner never sees this expression type during transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include the
    +  // sub query plan. Doing so causes issue when we canonicalize expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    can we use `expr.semanticHash` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107659893
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala ---
    @@ -83,6 +116,20 @@ object SubqueryExpression {
           case _ => false
         }.isDefined
       }
    +
    +  /**
    +   * Clean the outer references by normalizing them to BindReference in the same way
    +   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two
    +   * plans which has subquery expressions. This method returns a [[CanonicalizedSubqueryExpr]]
    +   * which wraps the underlying [[SubqueryExpression]].
    +   */
    +  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
    +    // Normalize the outer references in the subquery plan.
    +    val subPlan = e.plan.transformAllExpressions {
    +      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)
    --- End diff --
    
    if we bind all reference here, can we solve the problem that reordering is nondeterministic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL][WIP] Caching logical plans containing...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    **[Test build #74772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74772/testReport)** for PR 17330 at commit [`2b9d717`](https://github.com/apache/spark/commit/2b9d717be385b8aab2f720047cc248ed2fac6bb8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17330
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75710/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org