You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2015/10/10 03:11:34 UTC

[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/9055

    [SPARK-4226][SQL]Add subquery (not) in/exists support

    Some known feature that we don't support right now, but will add it later.
    
    We don't support the outer UDAF function used in the correlated query, combined with
    outer having clause, which requires the implicit projection change for the outer query.
    ```sql
    select b.key, min(b.value)
    from src b
    group by b.key
    having exists (
        select a.key
        from src a
         where a.value > 'val_9' and a.value = min(b.value) -- min(b.value) implicits requires the outer query to add more field in the projection.
    )
    ```
    
    We don't support the multiple references for the outer query in both the subquery in both projection and filter clause.
    ```sql
    select key, value
    from src b
    where value in
        (select s1.key+ b.key
         from src s1 
         where s1.key > '9'  and s1.value = b.value) -- both b.value and b.key present in the subquery, but in projection and filter clause respectively.
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark anti_join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9055
    
----
commit e3aa2553cc3eeb78f8bd15a5f97ccd97032bf954
Author: Cheng Hao <ha...@intel.com>
Date:   2015-10-10T00:45:39Z

    add subquery (not) in/exists support

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148563285
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147020852
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147030426
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43508/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42593468
  
    --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala ---
    @@ -265,6 +265,32 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
         // the answer is sensitive for jdk version
         "udf_java_method",
     
    +    // TODO Hive window function cannot be serialized in generating the golden file
    +    "subquery_in",
    +    // As we don't support the outer UDAF function used in the correlated query, combined with
    +    // outer having clause: e.g.:
    +    //    select b.key, min(b.value)
    +    //    from src b
    +    //    group by b.key
    +    //    having exists ( select a.key
    +    //    from src a
    +    //      where a.value > 'val_9' and a.value = min(b.value)
    +    //    )
    +    // It throws exception like
    +    // "cannot resolve 'b.value' given input columns key, _c1, key, value;"
    +    // As the outer aggregation doesn't output the field 'value, we need rule
    +    // for further resovling the having expressions.
    +    "subquery_notin_having",
    +    "subquery_exists_having",
    --- End diff --
    
    Actually I was planning to support the having in the follow up PRs, this requires more code change in the analyzer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147266579
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148570292
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581795
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    --- End diff --
    
    I think we need a comment at here to explain why we need this check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by roland-mendix <gi...@git.apache.org>.
Github user roland-mendix commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-165722584
  
    We've added our own In/Exists - plus Subquery in Select - support to a partial fork of Spark SQL Catalyst (which we use in transformations from our own query language to SQL for relational databases). But since In, Exists and Select projections are Expressions which will then contain LogicalPlans (Subquery/Select with nested LogicalPlans with potential nested Expressions) this makes whole-tree transformations kind of convoluted since we''ve got to deal with 'pivot points' for these 2 types of TreeNodes, where a recursive transformation can only be done on 1 specific type. Why was the choice made in Catalyst to make LogicalPlan/QueryPlan and Expression separate subclasses of TreeNode, instead of e.g. also make QueryPlan inherit from Expression? The code also contains duplicate functionality, like LeafNode/LeafExpression, UnaryNode/UnaryExpression and BinaryNode/BinaryExpression. Much obliged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147558163
  
    cc @rxin as well, this is required by many of our customers, and most of the code change is about the unit test, should not be hard to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583933
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    --- End diff --
    
    Actually, exists subquery should ONLY be correlated, otherwise it's probably meaningless.
    
    This is what I get from Hive
    ```
    hive> select value from src a where exists (select key from src);
    FAILED: SemanticException Line 1:54 Invalid SubQuery expression 'key' in definition of SubQuery sq_1 [
    exists (select key from src)
    ] used as sq_1 at Line 1:30: For Exists/Not Exists operator SubQuery must be Correlated.
    
    hive> select value from src a where exists (select key from src where key > 10000);
    FAILED: SemanticException Line 1:54 Invalid SubQuery expression '10000' in definition of SubQuery sq_1 [
    exists (select key from src where key > 10000)
    ] used as sq_1 at Line 1:30: For Exists/Not Exists operator SubQuery must be Correlated.
    
    // and even failed in
    hive> select value from src a where exists (select key+a.key from src where key > 10000);
    FAILED: SemanticException Line 1:60 Invalid SubQuery expression '10000' in definition of SubQuery sq_1 [
    exists (select key+a.key from src where key > 10000)
    ] used as sq_1 at Line 1:30: For Exists/Not Exists operator SubQuery must be Correlated.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42586670
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala ---
    @@ -47,4 +48,9 @@ case object RightOuter extends JoinType
     
     case object FullOuter extends JoinType
     
    -case object LeftSemi extends JoinType
    +abstract class LeftSemiJoin extends JoinType
    --- End diff --
    
    Making a common abstract parent class for `LeftSemiJoin` and `LeftAntiJoin` will reduce the code change in `Optimizer` and `Strategies`, but you're right, I didn't give it a proper name to reflect the concept, not sure if you have better idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581790
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    --- End diff --
    
    Can we split this statement to multiple ones for better readability?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147074055
  
      [Test build #43528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43528/console) for   PR 9055 at commit [`ab22171`](https://github.com/apache/spark/commit/ab22171a437497eaa26f55ca661cfc3dcb478b71).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148326620
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43782/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581800
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (projectList.length != 1) {
    +              throw new AnalysisException(
    +                s"Expect only 1 projection in In Subquery Expression, but we got $projectList")
    +            } else {
    +              val rightKey = ResolveReferences.tryResolveAttributes(projectList(0), right)
    +
    +              if (!rightKey.resolved) {
    +                throw new AnalysisException(
    +                  s"Outer query expression should be only presented at the filter clause, " +
    +                    s"but we got $rightKey")
    +              }
    +              Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +                if (positive) LeftSemi else LeftAnti,
    +                Some(
    +                  And(
    +                    ResolveReferences.tryResolveAttributes(condition, right),
    +                    EqualTo(rightKey, key))))
    --- End diff --
    
    It will be better to split it to multiple lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149899904
  
    **[Test build #44064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44064/consoleFull)** for PR 9055 at commit [`cb69166`](https://github.com/apache/spark/commit/cb69166b1aad6f43a6cb9d400f7b36155d845367).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149777477
  
    Two general comments. First, we need to add document to explain how we rewrite a plan when (1) there is a uncorrelated subquery and (2) there is a correlated subquery. Second, for those rewriting rules, I am thinking if we can have more concise ones. For uncorrelated subqueries, the subquery itself should be a resolved logical plan, right? For correlated subqueries, we only need to extract those conditions referring columns in the outer query block, right? Do we really need to matching those different specific patterns? Can we have some general logics? 
    
    Actually, does this pr try to support uncorrelated in/not in/exists/not exists subqueries?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581781
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    --- End diff --
    
    For the scaladoc, can we add details on how this rule work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581807
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ---
    @@ -263,3 +263,50 @@ case class UnresolvedAlias(child: Expression)
     
       override lazy val resolved = false
     }
    +
    +trait SubQueryExpression extends Unevaluable {
    +  def subquery: LogicalPlan
    +
    +  override def dataType: DataType = BooleanType
    +  override def foldable: Boolean = false
    +  override def nullable: Boolean = false
    +
    +  def withNewSubQuery(newSubquery: LogicalPlan): this.type
    +}
    +
    +/**
    + * Exist subquery expression, only used in filter only
    + */
    +case class Exists(subquery: LogicalPlan, positive: Boolean)
    +  extends LeafExpression with SubQueryExpression {
    +  override def withNewSubQuery(newSubquery: LogicalPlan): this.type = {
    +    this.copy(subquery = newSubquery).asInstanceOf[this.type]
    +  }
    +
    +  override lazy val resolved = true
    +
    +  override def toString: String = if (positive) {
    +    s"Exists(${subquery.asCode})"
    +  } else {
    +    s"NotExists(${subquery.asCode})"
    +  }
    +}
    +
    +/**
    + * In subquery expression, only used in filter only
    + */
    +case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)
    --- End diff --
    
    I think we need to explain the meaning of `child` at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42620776
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala ---
    @@ -47,4 +48,9 @@ case object RightOuter extends JoinType
     
     case object FullOuter extends JoinType
     
    -case object LeftSemi extends JoinType
    +abstract class LeftSemiJoin extends JoinType
    --- End diff --
    
    Well, actually in https://en.wikipedia.org/wiki/Relational_algebra, `anti-join` also sometimes call `anti-semijoin`, that's the original idea why I make the LeftSemiJoin as an abstract class with 2 objects `LeftSemi` & `LeftAnti`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42584123
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    --- End diff --
    
    This is the only case we support right now for `EXISTS(NOT)`. (This is the correlated case, the correlated reference via the `Filter` operator)
    `SELECT value FROM src a WHERE EXISTS (SELECT key FROM src b WHERE a.key=b.key AND a.key> 100)`.
    
    (These are the unrelated cases)
    `SELECT value FROM src a WHERE EXISTS (SELECT key FROM src b)`. // without Filter
    `SELECT value FROM src a WHERE EXISTS (SELECT key FROM src b WHERE a.key> 100)`. // Filter condition is resolved, then definitely an unrelated reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147019708
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42617393
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSemiJoinSuite.scala ---
    @@ -0,0 +1,450 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import org.apache.spark.sql.{SQLConf, AnalysisException}
    +import org.scalatest.BeforeAndAfter
    +
    +import org.apache.spark.sql.hive.test.TestHive
    +
    +/**
    + * A test suite about the IN /NOT IN /EXISTS / NOT EXISTS subquery.
    + */
    +abstract class HiveSemiJoinSuite extends HiveComparisonTest with BeforeAndAfter {
    +  import org.apache.spark.sql.hive.test.TestHive.implicits._
    +  import org.apache.spark.sql.hive.test.TestHive._
    +
    +  private val confSortMerge = TestHive.getConf(SQLConf.SORTMERGE_JOIN)
    +  private val confBroadcastJoin = TestHive.getConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD)
    +  private val confCodegen = TestHive.getConf(SQLConf.CODEGEN_ENABLED)
    +  private val confTungsten = TestHive.getConf(SQLConf.TUNGSTEN_ENABLED)
    +
    +  def enableSortMerge(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.SORTMERGE_JOIN, enable)
    +  }
    +
    +  def enableBroadcastJoin(enable: Boolean): Unit = {
    +    if (enable) {
    +      TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, -1)
    +    } else {
    +      TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, Int.MaxValue)
    +    }
    +  }
    +
    +  def enableCodeGen(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.CODEGEN_ENABLED, enable)
    +  }
    +
    +  def enableTungsten(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.TUNGSTEN_ENABLED, enable)
    +  }
    +
    +  override def beforeAll() {
    +    // override this method to update the configuration
    +    TestHive.cacheTables = true
    +  }
    +
    +  override def afterAll() {
    +    // restore the configuration
    +    TestHive.setConf(SQLConf.SORTMERGE_JOIN, confSortMerge)
    +    TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, confBroadcastJoin)
    +    TestHive.setConf(SQLConf.CODEGEN_ENABLED, confCodegen)
    +    TestHive.setConf(SQLConf.TUNGSTEN_ENABLED, confTungsten)
    +    TestHive.cacheTables = false
    +  }
    +
    +  ignore("reference the expression `min(b.value)` that required implicit change the outer query") {
    +    sql("""select b.key, min(b.value)
    +      |from src b
    +      |group by b.key
    +      |having exists ( select a.key
    +      |from src a
    +      |where a.value > 'val_9' and a.value = min(b.value))""".stripMargin)
    +  }
    +
    +  ignore("multiple reference the outer query variables") {
    +    sql("""select key, value, count(*)
    +      |from src b
    +      |group by key, value
    +      |having count(*) in (
    +      |  select count(*)
    +      |  from src s1
    +      |  where s1.key > '9' and s1.value = b.value
    +      |  group by s1.key)""".stripMargin)
    +  }
    +
    +  // IN Subquery Unit tests
    +  createQueryTest("(unrelated)WHERE clause with IN #1",
    +    """select *
    +    |from src
    +    |where key in (select key from src)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with NOT IN #1",
    +    """select *
    +    |from src
    +    |where key not in (select key from src)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with IN #2",
    +    """select *
    +    |from src
    +    |where src.key in (select t.key from src t)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with NOT IN #2",
    +    """select *
    +    |from src
    +    |where src.key not in (select t.key % 193 from src t)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #3",
    +    """select *
    +      |from src
    +      |where src.key in (select key from src s1 where s1.key > 9)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #3",
    +    """select *
    +      |from src
    +      |where src.key not in (select key from src s1 where s1.key > 9)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #4",
    +    """select *
    +      |from src
    +      |where src.key in (select max(s1.key) from src s1 group by s1.value)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #4",
    +    """select *
    +      |from src
    +      |where src.key not in (select max(s1.key) % 31 from src s1 group by s1.value)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #5",
    +    """select *
    +      |from src
    +      |where src.key in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #5",
    +    """select *
    +      |from src
    +      |where src.key not in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #6",
    +    """select *
    +      |from src
    +      |where src.key in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |      and src.key > 10
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #6",
    +    """select *
    +      |from src
    +      |where src.key not in
    +      |(select max(s1.key) % 31 from src s1 group by s1.value having max(s1.key) > 3)
    +      |      and src.key > 10
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #7",
    +    """select *
    +      |  from src b
    +      |where b.key in
    +      |(select count(*)
    +      |  from src a
    +      |  where a.key > 100
    +      |) and b.key < 200
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #7",
    +    """select *
    +      |  from src b
    +      |where b.key not in
    +      |(select count(*)
    +      |  from src a
    +      |  where a.key > 100
    +      |) and b.key < 200
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with IN #1",
    +    """select *
    +      |from src b
    +      |where b.key in
    +      |        (select a.key
    +      |         from src a
    +      |         where b.value = a.value and a.key > 9
    +      |        )
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT IN #1",
    +    """select *
    +    |from src b
    +    |where b.key not in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5"""
    +      .stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with IN #2",
    +    """select *
    +    |from src b
    +    |where b.key in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT IN #2",
    +    """select *
    +    |from src b
    +    |where b.key not in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with EXISTS #1",
    +    """select *
    +    |from src b
    +    |where EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT EXISTS #1",
    +    """select *
    +    |from src b
    +    |where NOT EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with EXISTS #2",
    +    """select *
    +    |from src b
    +    |where EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT EXISTS #2",
    +    """select *
    +    |from src b
    +    |where NOT EXISTS
    +    |        (select a.key % 291
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +}
    +
    +class SemiJoinHashJoin extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(false)
    +    enableTungsten(false)
    +    enableCodeGen(false)
    +  }
    +}
    +
    +class SemiJoinHashJoinTungsten extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(false)
    +    enableTungsten(true)
    +    enableCodeGen(true)
    +  }
    +}
    +
    +class SemiJoinSortMerge extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(true)
    +    enableTungsten(false)
    +    enableCodeGen(false)
    +  }
    +}
    +
    +class SemiJoinSortMergeTungsten extends HiveSemiJoinSuite {
    --- End diff --
    
    Oh, even I am not so sure, but more unit test is harmless, just in cases people will add the sort merge semi join support in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148592465
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583623
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    --- End diff --
    
    `select value from src where key in (select distinct key from src where key > 10)`, anyway I will add more doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-153920042
  
    @jameszhouyi 
    Agree. This is an important feature for any SQL engine. We are also waiting for this feature. So far, using joins is an alternative to bypass it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-168104208
  
    ok, closing it now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147066963
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-153857289
  
    @jameszhouyi 
    We hit the same issue. Now, we bypass it by using joins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by jameszhouyi <gi...@git.apache.org>.
Github user jameszhouyi commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-153918005
  
    Thank you @gatorsmile for your suggestion. 
    I think this feature("IN" sub query) is necessary for Spark SQL engine as SQL-on-Hadoop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148592469
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43826/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148326579
  
      [Test build #43782 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43782/console) for   PR 9055 at commit [`7511f47`](https://github.com/apache/spark/commit/7511f47089ed58f913a81df2113cbe300903be63).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148326617
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148592066
  
      [Test build #43826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43826/console) for   PR 9055 at commit [`7511f47`](https://github.com/apache/spark/commit/7511f47089ed58f913a81df2113cbe300903be63).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42593628
  
    --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala ---
    @@ -265,6 +265,32 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
         // the answer is sensitive for jdk version
         "udf_java_method",
     
    +    // TODO Hive window function cannot be serialized in generating the golden file
    +    "subquery_in",
    +    // As we don't support the outer UDAF function used in the correlated query, combined with
    +    // outer having clause: e.g.:
    +    //    select b.key, min(b.value)
    +    //    from src b
    +    //    group by b.key
    +    //    having exists ( select a.key
    +    //    from src a
    +    //      where a.value > 'val_9' and a.value = min(b.value)
    +    //    )
    +    // It throws exception like
    +    // "cannot resolve 'b.value' given input columns key, _c1, key, value;"
    +    // As the outer aggregation doesn't output the field 'value, we need rule
    +    // for further resovling the having expressions.
    +    "subquery_notin_having",
    +    "subquery_exists_having",
    --- End diff --
    
    I mean the case the referenced attribute in the aggregation function is not explicit provided in the outer query case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581809
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala ---
    @@ -47,4 +48,9 @@ case object RightOuter extends JoinType
     
     case object FullOuter extends JoinType
     
    -case object LeftSemi extends JoinType
    +abstract class LeftSemiJoin extends JoinType
    --- End diff --
    
    Why have a abstract class at here? Also, looks like `LeftSemiJoin` is not a good name since `LeftAnti` extends it and `LeftAnti` is a different concept. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-153921494
  
    Unfortunately, we probably will miss this in Spark 1.6, as it's almost code freeze for 1.6. @rxin @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149908301
  
    BTW: IN / NOT IN definitely supports the uncorrelated, but EXISTS/NOT EXISTS are not in this cases, the same behavior as Hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581787
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    --- End diff --
    
    Can you add comment to explain why we need to explicitly use `transformUp`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by scwf <gi...@git.apache.org>.
Github user scwf commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147272550
  
    what's the difference with #4812?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581793
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    --- End diff --
    
    Is this pattern too restrict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583439
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    --- End diff --
    
    yes, I will add more detailed description for this rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-154154884
  
    Yeah, sorry.  It is too late for a patch this large.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147019821
  
      [Test build #43506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43506/consoleFull) for   PR 9055 at commit [`e3aa255`](https://github.com/apache/spark/commit/e3aa2553cc3eeb78f8bd15a5f97ccd97032bf954).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148320902
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147030424
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149941345
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44064/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42586165
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -408,6 +552,25 @@ class Analyzer(
             }
         }
     
    +    // Try to resolve the attributes from the given logical plan
    +    def tryResolveAttributes(expr: Expression, q: LogicalPlan): Expression = {
    +      checkAnalysis(q)
    +      val projection = Project(q.output, q)
    +
    +      logTrace(s"Attempting to resolve ${expr.simpleString}")
    +      expr transformUp  {
    +        case u @ UnresolvedAlias(expr) => expr
    +        case u @ UnresolvedAttribute(nameParts) =>
    +          // Leave unchanged if resolution fails.  Hopefully will be resolved next round.
    +          val result =
    +            withPosition(u) { projection.resolveChildren(nameParts, resolver).getOrElse(u) }
    +          logDebug(s"Resolving $u to $result")
    +          result
    +        case UnresolvedExtractValue(child, fieldExpr) if child.resolved =>
    +          ExtractValue(child, fieldExpr, resolver)
    +      }
    +    }
    +
    --- End diff --
    
    This is the rule that I copied from the `ResolveReferences` with slight changes, maybe we should leave it along `ResolveReferences`, otherwise people probably forgot to update it once we need to update the logic of attributes resolution.
    
    And I was planning to move this code into the `LogicalPlan`, but need more thinking how to make the code shared with the rule `ResolveReferences`, how about leave it for the further improvement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-168101565
  
    @chenghao-intel How about we close this PR for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149898000
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581817
  
    --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala ---
    @@ -265,6 +265,32 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
         // the answer is sensitive for jdk version
         "udf_java_method",
     
    +    // TODO Hive window function cannot be serialized in generating the golden file
    +    "subquery_in",
    +    // As we don't support the outer UDAF function used in the correlated query, combined with
    +    // outer having clause: e.g.:
    +    //    select b.key, min(b.value)
    +    //    from src b
    +    //    group by b.key
    +    //    having exists ( select a.key
    +    //    from src a
    +    //      where a.value > 'val_9' and a.value = min(b.value)
    +    //    )
    +    // It throws exception like
    +    // "cannot resolve 'b.value' given input columns key, _c1, key, value;"
    +    // As the outer aggregation doesn't output the field 'value, we need rule
    +    // for further resovling the having expressions.
    +    "subquery_notin_having",
    +    "subquery_exists_having",
    --- End diff --
    
    Does this mean that we need more work of this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581788
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    --- End diff --
    
    Is the pattern of `Project(_, Filter(condition, right))` to restrict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581805
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ---
    @@ -263,3 +263,50 @@ case class UnresolvedAlias(child: Expression)
     
       override lazy val resolved = false
     }
    +
    +trait SubQueryExpression extends Unevaluable {
    +  def subquery: LogicalPlan
    +
    +  override def dataType: DataType = BooleanType
    +  override def foldable: Boolean = false
    +  override def nullable: Boolean = false
    +
    +  def withNewSubQuery(newSubquery: LogicalPlan): this.type
    +}
    +
    +/**
    + * Exist subquery expression, only used in filter only
    + */
    +case class Exists(subquery: LogicalPlan, positive: Boolean)
    +  extends LeafExpression with SubQueryExpression {
    --- End diff --
    
    format


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148562591
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-168152168
  
    Found a related HIVE JIRA to support the left anti join: https://issues.apache.org/jira/browse/HIVE-12519 
    
    However, their proposed solution has a hole. Anyway, if we can support the anti join at the run time, it is much efficient.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581802
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -408,6 +552,25 @@ class Analyzer(
             }
         }
     
    +    // Try to resolve the attributes from the given logical plan
    +    def tryResolveAttributes(expr: Expression, q: LogicalPlan): Expression = {
    +      checkAnalysis(q)
    +      val projection = Project(q.output, q)
    +
    +      logTrace(s"Attempting to resolve ${expr.simpleString}")
    +      expr transformUp  {
    +        case u @ UnresolvedAlias(expr) => expr
    +        case u @ UnresolvedAttribute(nameParts) =>
    +          // Leave unchanged if resolution fails.  Hopefully will be resolved next round.
    +          val result =
    +            withPosition(u) { projection.resolveChildren(nameParts, resolver).getOrElse(u) }
    +          logDebug(s"Resolving $u to $result")
    +          result
    +        case UnresolvedExtractValue(child, fieldExpr) if child.resolved =>
    +          ExtractValue(child, fieldExpr, resolver)
    +      }
    +    }
    +
    --- End diff --
    
    Should we just use the resolve method of a logical plan?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148563286
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43822/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148560924
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581797
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (projectList.length != 1) {
    +              throw new AnalysisException(
    +                s"Expect only 1 projection in In Subquery Expression, but we got $projectList")
    --- End diff --
    
    Looks like we need to say that subquery should only generate a single column. Also, we need to mention the number of columns that are generated by the subquery. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583413
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -69,6 +71,7 @@ class Analyzer(
           WindowsSubstitution ::
           Nil : _*),
         Batch("Resolution", fixedPoint,
    +      RewriteFilterSubQuery ::
    --- End diff --
    
    I didn't test it if I place it in the last, but should not be a problem, is there any concern for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147074068
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147267270
  
      [Test build #43552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43552/consoleFull) for   PR 9055 at commit [`ab22171`](https://github.com/apache/spark/commit/ab22171a437497eaa26f55ca661cfc3dcb478b71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583551
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    --- End diff --
    
    Ideally, we should support multiple in/exists subqueries, but I don't want to make this PR huge, actually I am planning to do in the follow up PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147019704
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by maver1ck <gi...@git.apache.org>.
Github user maver1ck commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-164912029
  
    So what next ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148560878
  
    Seems not related.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581792
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    --- End diff --
    
    It is not super clear what this error message means. What should users do? When we will hit this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42625267
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala ---
    @@ -70,7 +77,10 @@ case class BroadcastLeftSemiJoinHash(
                   InternalAccumulator.PEAK_EXECUTION_MEMORY).add(unsafe.getUnsafeSize)
               case _ =>
             }
    -        hashSemiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    +        sj match {
    +          case LeftSemi => hashSemiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    +          case LeftAnti => hashAntiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    --- End diff --
    
    OK, but leave it for now until we feel it's ready to be merge, otherwise, it probably conflict-prone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148562593
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43819/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583339
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
    @@ -1485,14 +1490,39 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
       val BETWEEN = "(?i)BETWEEN".r
       val WHEN = "(?i)WHEN".r
       val CASE = "(?i)CASE".r
    -
    -  protected def nodeToExpr(node: Node): Expression = node match {
    +  val EXISTS = "(?i)EXISTS".r
    +
    +  protected def nodeToExpr(node: Node, context: Context): Expression = node match {
    --- End diff --
    
    We don't use the `context` in this PR, however, the `def nodeToPlan(..)` need the `context`, as in this implementation, I actually add 2 extra expressions, they take the `LogcialPlan` as parameters, which mean the function `nodeToExpr` will call `nodeToPlan()` and pass the `context` down. Otherwise I have to pass the `null` to `nodeToPlan()`, which probably even more confusing and error-prone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581799
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (projectList.length != 1) {
    +              throw new AnalysisException(
    +                s"Expect only 1 projection in In Subquery Expression, but we got $projectList")
    +            } else {
    +              val rightKey = ResolveReferences.tryResolveAttributes(projectList(0), right)
    --- End diff --
    
    Why do we need to manually resolve attributes at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148571269
  
      [Test build #43826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43826/consoleFull) for   PR 9055 at commit [`7511f47`](https://github.com/apache/spark/commit/7511f47089ed58f913a81df2113cbe300903be63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148561446
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42583480
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    --- End diff --
    
    This is not supported, as the code right below, it will throws exception if this happens.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147020843
  
      [Test build #43506 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43506/console) for   PR 9055 at commit [`e3aa255`](https://github.com/apache/spark/commit/e3aa2553cc3eeb78f8bd15a5f97ccd97032bf954).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581780
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -69,6 +71,7 @@ class Analyzer(
           WindowsSubstitution ::
           Nil : _*),
         Batch("Resolution", fixedPoint,
    +      RewriteFilterSubQuery ::
    --- End diff --
    
    This rule does not need to go first, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148560908
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147030367
  
      [Test build #43508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43508/console) for   PR 9055 at commit [`b382bc9`](https://github.com/apache/spark/commit/b382bc96f4b5f301df714dc888b9cce5f0f201d6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147263349
  
    Seems the failure is not related.
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147278139
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147021470
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147074069
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43528/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147021836
  
      [Test build #43508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43508/consoleFull) for   PR 9055 at commit [`b382bc9`](https://github.com/apache/spark/commit/b382bc96f4b5f301df714dc888b9cce5f0f201d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149941094
  
    **[Test build #44064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44064/consoleFull)** for PR 9055 at commit [`cb69166`](https://github.com/apache/spark/commit/cb69166b1aad6f43a6cb9d400f7b36155d845367).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `trait SubQueryExpression extends Unevaluable `\n  * `case class Exists(subquery: LogicalPlan, positive: Boolean)`\n  * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581783
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    --- End diff --
    
    Looks like the error message is not very user-friendly. I think we need to be more specific on why the query is not supported.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147020853
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43506/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581789
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    --- End diff --
    
    Why? A exists subquery can be uncorrelated, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581813
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala ---
    @@ -70,7 +77,10 @@ case class BroadcastLeftSemiJoinHash(
                   InternalAccumulator.PEAK_EXECUTION_MEMORY).add(unsafe.getUnsafeSize)
               case _ =>
             }
    -        hashSemiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    +        sj match {
    +          case LeftSemi => hashSemiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    +          case LeftAnti => hashAntiJoin(streamIter, numLeftRows, hashedRelation, numOutputRows)
    --- End diff --
    
    I think we need to rename the operator if we want to introduce the logic of anti join at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581785
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    --- End diff --
    
    Can you add comment to explain when we will get a `Distinct`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147019968
  
    cc @marmbrus @yhuai @ravipesala
    This implementation inspired by #3249, by using the `SubQueryExpression`. and also the follow up with #4812.
    
    Since the anti join is another type of `SEMI JOIN`, I added it back here for performance concern in transform the "NOT EXISTS / NOT IN" subquery.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-168101552
  
    I had a offline discussion with @chenghao-intel. We will split this PR to smaller PRs. The first work will be on the backend operators. Then, we will add parser and analyzer rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147066967
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148570097
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147266665
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147067234
  
      [Test build #43528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43528/consoleFull) for   PR 9055 at commit [`ab22171`](https://github.com/apache/spark/commit/ab22171a437497eaa26f55ca661cfc3dcb478b71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147266659
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by jameszhouyi <gi...@git.apache.org>.
Github user jameszhouyi commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-152941016
  
    Hi @yhuai ,
    This missing feature("IN" sub query) in Spark SQL blocked our real-world case. Could you please help to review this PR ?  Strongly hopefully this PR feature can be merged in Spark 1.6.0 ( I saw the Hive implementation supported such feature ). Thanks in advanced !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149897980
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149907407
  
    Thank you @yhuai for reviewing this.
    I've added some more docs for this PR, hopefully make more sense. 
    
    First, I'll agree with you to make a general logic to partially resolve the correlated condition within the subquery, but it's probably not that easy, particularly we need to give more concise error message to the end user, so my suggestion is to leave it for the future improvement, probably we will have better idea to simplify that by having enough feature supported with the follow up PRs (See my TODO in the description), as currently, the limit patterns actually works for most of cases.
    
    Second, I totally agree with the Join Type comments, LeftSemiJoin <-> LeftSemi <-> LeftAnti, the motivation I am trying to make a parent class for LeftSemi / LeftAnti is for reducing the code change in `Optimizer` and `SparkStrategies`, maybe I should rename it to `LeftSemiOrAntiJoin` as the parent class. As well as the Operators' name, since we no longer the `LeftSemiXXX`, but also supports the `LeftAntixxx`.
    
    Still, I hope we can merge this PR in 1.6 release, as it's almost 1 years passed since the previous PRs created in #3249 & #4812. And I will keep updating the code once we have the general agreement for the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148560889
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel closed the pull request at:

    https://github.com/apache/spark/pull/9055


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147278144
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43552/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by scwf <gi...@git.apache.org>.
Github user scwf commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147273499
  
    ok, does this support multi exists and in in where clause?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581775
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
    @@ -1485,14 +1490,39 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
       val BETWEEN = "(?i)BETWEEN".r
       val WHEN = "(?i)WHEN".r
       val CASE = "(?i)CASE".r
    -
    -  protected def nodeToExpr(node: Node): Expression = node match {
    +  val EXISTS = "(?i)EXISTS".r
    +
    +  protected def nodeToExpr(node: Node, context: Context): Expression = node match {
    --- End diff --
    
    Do we need to pass in `context`? We added `context` to the argument list of `nodeToPlan` to support creating view. We are not expecting a subqeury expr is for creating a view, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147277801
  
      [Test build #43552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43552/console) for   PR 9055 at commit [`ab22171`](https://github.com/apache/spark/commit/ab22171a437497eaa26f55ca661cfc3dcb478b71).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SubQueryExpression extends Unevaluable `
      * `case class Exists(subquery: LogicalPlan, positive: Boolean)`
      * `case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147273772
  
    No, we don't support that in this PR, but should be very easy to support once this PR merged. I can plan the work if you feel that's very critical to your customers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147272870
  
    This is much simpler than #4812, by using the `SubQueryExpression`, particularly in processing the case
    `key IN (subquery) AND other_condition` case. #4812 doesn't support the `AND other_condition`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148570278
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148320889
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581865
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSemiJoinSuite.scala ---
    @@ -0,0 +1,450 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import org.apache.spark.sql.{SQLConf, AnalysisException}
    +import org.scalatest.BeforeAndAfter
    +
    +import org.apache.spark.sql.hive.test.TestHive
    +
    +/**
    + * A test suite about the IN /NOT IN /EXISTS / NOT EXISTS subquery.
    + */
    +abstract class HiveSemiJoinSuite extends HiveComparisonTest with BeforeAndAfter {
    +  import org.apache.spark.sql.hive.test.TestHive.implicits._
    +  import org.apache.spark.sql.hive.test.TestHive._
    +
    +  private val confSortMerge = TestHive.getConf(SQLConf.SORTMERGE_JOIN)
    +  private val confBroadcastJoin = TestHive.getConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD)
    +  private val confCodegen = TestHive.getConf(SQLConf.CODEGEN_ENABLED)
    +  private val confTungsten = TestHive.getConf(SQLConf.TUNGSTEN_ENABLED)
    +
    +  def enableSortMerge(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.SORTMERGE_JOIN, enable)
    +  }
    +
    +  def enableBroadcastJoin(enable: Boolean): Unit = {
    +    if (enable) {
    +      TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, -1)
    +    } else {
    +      TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, Int.MaxValue)
    +    }
    +  }
    +
    +  def enableCodeGen(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.CODEGEN_ENABLED, enable)
    +  }
    +
    +  def enableTungsten(enable: Boolean): Unit = {
    +    TestHive.setConf(SQLConf.TUNGSTEN_ENABLED, enable)
    +  }
    +
    +  override def beforeAll() {
    +    // override this method to update the configuration
    +    TestHive.cacheTables = true
    +  }
    +
    +  override def afterAll() {
    +    // restore the configuration
    +    TestHive.setConf(SQLConf.SORTMERGE_JOIN, confSortMerge)
    +    TestHive.setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, confBroadcastJoin)
    +    TestHive.setConf(SQLConf.CODEGEN_ENABLED, confCodegen)
    +    TestHive.setConf(SQLConf.TUNGSTEN_ENABLED, confTungsten)
    +    TestHive.cacheTables = false
    +  }
    +
    +  ignore("reference the expression `min(b.value)` that required implicit change the outer query") {
    +    sql("""select b.key, min(b.value)
    +      |from src b
    +      |group by b.key
    +      |having exists ( select a.key
    +      |from src a
    +      |where a.value > 'val_9' and a.value = min(b.value))""".stripMargin)
    +  }
    +
    +  ignore("multiple reference the outer query variables") {
    +    sql("""select key, value, count(*)
    +      |from src b
    +      |group by key, value
    +      |having count(*) in (
    +      |  select count(*)
    +      |  from src s1
    +      |  where s1.key > '9' and s1.value = b.value
    +      |  group by s1.key)""".stripMargin)
    +  }
    +
    +  // IN Subquery Unit tests
    +  createQueryTest("(unrelated)WHERE clause with IN #1",
    +    """select *
    +    |from src
    +    |where key in (select key from src)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with NOT IN #1",
    +    """select *
    +    |from src
    +    |where key not in (select key from src)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with IN #2",
    +    """select *
    +    |from src
    +    |where src.key in (select t.key from src t)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest("(unrelated)WHERE clause with NOT IN #2",
    +    """select *
    +    |from src
    +    |where src.key not in (select t.key % 193 from src t)
    +    |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #3",
    +    """select *
    +      |from src
    +      |where src.key in (select key from src s1 where s1.key > 9)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #3",
    +    """select *
    +      |from src
    +      |where src.key not in (select key from src s1 where s1.key > 9)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #4",
    +    """select *
    +      |from src
    +      |where src.key in (select max(s1.key) from src s1 group by s1.value)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #4",
    +    """select *
    +      |from src
    +      |where src.key not in (select max(s1.key) % 31 from src s1 group by s1.value)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #5",
    +    """select *
    +      |from src
    +      |where src.key in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #5",
    +    """select *
    +      |from src
    +      |where src.key not in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #6",
    +    """select *
    +      |from src
    +      |where src.key in
    +      |(select max(s1.key) from src s1 group by s1.value having max(s1.key) > 3)
    +      |      and src.key > 10
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #6",
    +    """select *
    +      |from src
    +      |where src.key not in
    +      |(select max(s1.key) % 31 from src s1 group by s1.value having max(s1.key) > 3)
    +      |      and src.key > 10
    +      |order by key, value LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with IN #7",
    +    """select *
    +      |  from src b
    +      |where b.key in
    +      |(select count(*)
    +      |  from src a
    +      |  where a.key > 100
    +      |) and b.key < 200
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(unrelated)WHERE clause with NOT IN #7",
    +    """select *
    +      |  from src b
    +      |where b.key not in
    +      |(select count(*)
    +      |  from src a
    +      |  where a.key > 100
    +      |) and b.key < 200
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with IN #1",
    +    """select *
    +      |from src b
    +      |where b.key in
    +      |        (select a.key
    +      |         from src a
    +      |         where b.value = a.value and a.key > 9
    +      |        )
    +      |order by key, value
    +      |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT IN #1",
    +    """select *
    +    |from src b
    +    |where b.key not in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5"""
    +      .stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with IN #2",
    +    """select *
    +    |from src b
    +    |where b.key in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT IN #2",
    +    """select *
    +    |from src b
    +    |where b.key not in
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with EXISTS #1",
    +    """select *
    +    |from src b
    +    |where EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT EXISTS #1",
    +    """select *
    +    |from src b
    +    |where NOT EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |order by key, value
    +    |LIMIT 5""".stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with EXISTS #2",
    +    """select *
    +    |from src b
    +    |where EXISTS
    +    |        (select a.key
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +
    +  createQueryTest(
    +    "(correlated)WHERE clause with NOT EXISTS #2",
    +    """select *
    +    |from src b
    +    |where NOT EXISTS
    +    |        (select a.key % 291
    +    |         from src a
    +    |         where b.value = a.value and a.key > 9
    +    |        )
    +    |and b.key > 15
    +    |order by key, value
    +    |LIMIT 5""".
    +      stripMargin)
    +}
    +
    +class SemiJoinHashJoin extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(false)
    +    enableTungsten(false)
    +    enableCodeGen(false)
    +  }
    +}
    +
    +class SemiJoinHashJoinTungsten extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(false)
    +    enableTungsten(true)
    +    enableCodeGen(true)
    +  }
    +}
    +
    +class SemiJoinSortMerge extends HiveSemiJoinSuite {
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    enableSortMerge(true)
    +    enableTungsten(false)
    +    enableCodeGen(false)
    +  }
    +}
    +
    +class SemiJoinSortMergeTungsten extends HiveSemiJoinSuite {
    --- End diff --
    
    Sorry I probably missed something, we do not have sort merge semi join, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148561456
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42585365
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    +      if (subqueries.isEmpty) {
    +        None
    +      } else if (subqueries.length > 1) {
    +        throw new AnalysisException(
    +          s"Only 1 SubQuery expression is supported, but we got $subqueries")
    +      } else {
    +        val subQueryExpr = subqueries(0).asInstanceOf[SubQueryExpression]
    +        // try to resolve the subquery
    +
    +        val subquery = Analyzer.this.execute(subQueryExpr.subquery) match {
    +          case Distinct(child) => child // Distinct is useless for semi join, ignore it.
    +          case other => other
    +        }
    +        Some((subQueryExpr.withNewSubQuery(subquery), others))
    +      }
    +    }
    +
    +    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +      case f if f.childrenResolved == false => f
    +
    +      case f @ Filter(RewriteFilterSubQuery(subquery, others), left) =>
    +        subquery match {
    +          case Exists(Project(_, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (condition.resolved) {
    +              // Apparently, it should be not resolved here, since EXIST should be correlated.
    +              throw new AnalysisException(
    +                s"Exist clause should be correlated, but we got $condition")
    +            }
    +            Join(others.reduceOption(And).map(Filter(_, left)).getOrElse(left), right,
    +              if (positive) LeftSemi else LeftAnti,
    +              Some(ResolveReferences.tryResolveAttributes(condition, right)))
    +
    +          case Exists(right, positive) =>
    +            throw new AnalysisException(s"Exist clause should be correlated, but we got $right")
    +
    +          case InSubquery(key, Project(projectList, Filter(condition, right)), positive) =>
    +            checkAnalysis(right)
    +            if (projectList.length != 1) {
    +              throw new AnalysisException(
    +                s"Expect only 1 projection in In Subquery Expression, but we got $projectList")
    +            } else {
    +              val rightKey = ResolveReferences.tryResolveAttributes(projectList(0), right)
    --- End diff --
    
    This is a good question, actually this is a workaround to solve the ambiguous references issue like:.
    `SELECT 'value FROM src WHERE 'key IN (SELECT 'key FROM src b WHERE 'key > 100)`
    Literally, we will transform the SQL as:
    
    `SELECT 'value FROM src LEFT SEMI JOIN src b ON 'key = 'key and 'key > 100`, this is reference ambiguous for `'key`!
    
    The `ResolveReferences.tryResolveAttributes` will partially resolve the project list and filter condition of the subquery, and then what we got looks like:
    `SELECT 'value FROM src LEFI SEMI JOIN src b ON 'key = key#123 and key#123 > 100`
    
    And then we will leave the unresolved attributes for the other rules.
    
    There is another doable solution is complete the alias for the attributes / relations like first:
    `SELECT 'value FROM src WHERE 'key IN (SELECT 'key FROM src b WHERE 'key > 100)` =>
    `SELECT 'a.value FROM src a WHERE 'a.key IN (SELECT 'b.key FROM src b WHERE 'b.key > 100)` =>
    `SELECT 'a.value FROM src a LEFT SEMI JOIN src b ON 'a.key = 'b.key and 'b.key > 100`
    But this probably requires more code change, and probably will confusing people when they check the generated logical plan.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-149941344
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581782
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -270,6 +273,146 @@ class Analyzer(
       }
     
       /**
    +   * Rewrite the [[Exists]] [[In]] with left semi join or anti join.
    +   */
    +  object RewriteFilterSubQuery extends Rule[LogicalPlan] with PredicateHelper {
    +    def unapply(condition: Expression): Option[(Expression, Seq[Expression])] = {
    +      if (condition.resolved == false) {
    +        return None
    +      }
    +
    +      val conjuctions = splitConjunctivePredicates(condition).map(_ transformDown {
    +          // Remove the Cast expression for SubQueryExpression.
    +          case Cast(f: SubQueryExpression, BooleanType) => f
    +        }
    +      )
    +
    +      val (subqueries, others) = conjuctions.partition(c => c.isInstanceOf[SubQueryExpression])
    --- End diff --
    
    What will happen if I have something like `WHERE a IN (...) OR b IN (...)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-147021465
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9055#discussion_r42581804
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ---
    @@ -263,3 +263,50 @@ case class UnresolvedAlias(child: Expression)
     
       override lazy val resolved = false
     }
    +
    +trait SubQueryExpression extends Unevaluable {
    +  def subquery: LogicalPlan
    +
    +  override def dataType: DataType = BooleanType
    +  override def foldable: Boolean = false
    +  override def nullable: Boolean = false
    +
    +  def withNewSubQuery(newSubquery: LogicalPlan): this.type
    +}
    +
    +/**
    + * Exist subquery expression, only used in filter only
    + */
    +case class Exists(subquery: LogicalPlan, positive: Boolean)
    --- End diff --
    
    Looks like `positive` does not clearly explain the meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9055#issuecomment-148322379
  
      [Test build #43782 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43782/consoleFull) for   PR 9055 at commit [`7511f47`](https://github.com/apache/spark/commit/7511f47089ed58f913a81df2113cbe300903be63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org