You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2016/01/07 06:48:35 UTC

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/10630

    [SPARK-12656] [SQL] Implement Intersect with Left-semi Join

    Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
    
    After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark IntersectBySemiJoin

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10630
    
----
commit 0bd177135bc1c8210fa0ea5a8c7a4c5144d6227e
Author: gatorsmile <ga...@gmail.com>
Date:   2016-01-07T05:40:33Z

    replace Intersect with Left-semi Join

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-175469558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50176/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r50897086
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] {
     
     /**
      * Pushes certain operations to both sides of a Union, Intersect or Except operator.
    +=======
    + * Pushes certain operations to both sides of a Union or Except operator.
    +>>>>>>> IntersectBySemiJoinMerged
    --- End diff --
    
    Seems we need to remove this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176630203
  
    This is not that big. Let's just do it together here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49044137
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
       }
     
       test("intersect") {
    +    val intersectDF = lowerCaseData.intersect(lowerCaseData)
    +
    +    // Before Optimizer, the operator is Intersect
    --- End diff --
    
    ok, will add a new test suite for it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51057617
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    If we replace intersect with left-semi join during analysis, then we can get de-duplication for free(the rule for join can do the work), will it help here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176508356
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50313/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170208866
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49038/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170209280
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49044073
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
       }
     
       test("intersect") {
    +    val intersectDF = lowerCaseData.intersect(lowerCaseData)
    +
    +    // Before Optimizer, the operator is Intersect
    +    assert(intersectDF.queryExecution.analyzed.collect {
    +      case j@Intersect(_, _) => j
    +    }.size === 1)
    +
    +    // Before Optimizer, the operator is converted to LeftSemi Join
    +    assert(intersectDF.queryExecution.optimizedPlan.collect {
    +      case j@Join(_, _, LeftSemi, _) => j
    +    }.size === 1)
    +
         checkAnswer(
    -      lowerCaseData.intersect(lowerCaseData),
    +      intersectDF,
           Row(1, "a") ::
           Row(2, "b") ::
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    checkAnswer(
    --- End diff --
    
    leave a line of comment on what this is testing (e.g. null equality).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170209903
  
    **[Test build #49044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49044/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169901285
  
    After the merge, will submit another PR to support it. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174165441
  
    **[Test build #49936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49936/consoleFull)** for PR 10630 at commit [`6a7979d`](https://github.com/apache/spark/commit/6a7979d35d5cf89d9cd14eb6d6a2a4418a44a669).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169936735
  
    **[Test build #49012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49012/consoleFull)** for PR 10630 at commit [`a932cdb`](https://github.com/apache/spark/commit/a932cdb05c889ded8de72024b2eef19723bdb0f4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class JavaSampleActorReceiver<T> extends JavaActorReceiver `
      * `public class JavaActorWordCount `
      * `abstract class ActorReceiver extends Actor `
      * `abstract class JavaActorReceiver extends UntypedActor `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174148184
  
    sure, let me do it tonight. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169926993
  
    **[Test build #49008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49008/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169582121
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169618123
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51091050
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    ah makes sense, let's add some comment to explain this :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51233323
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -315,4 +315,7 @@ abstract class BinaryNode extends LogicalPlan {
       def right: LogicalPlan
     
       override def children: Seq[LogicalPlan] = Seq(left, right)
    +
    +  override lazy val resolved: Boolean =
    +    expressions.forall(_.resolved) && childrenResolved
    --- End diff --
    
    why override this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577467
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48907/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169757547
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49047907
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -329,6 +329,14 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    // check null equality
    --- End diff --
    
    A good case might be two tables that each contain two int rows [1], [1] and intersect them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49048434
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -329,6 +329,14 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    // check null equality
    --- End diff --
    
    Thank you! Let me add the test case. It passes but we should still add this case! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49044018
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    --- End diff --
    
    yeah. Forgot to specify the join type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51074011
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    So far, I am unable to create such test cases. Let me revert it back and enable de-duplication for Join only. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170200645
  
    **[Test build #49039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174144971
  
    @gatorsmile there is a conflict now - mind bringing it up to date?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r50899426
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1055,6 +1045,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    --- End diff --
    
    I think we need to add a comment at here to mention that this rewrite is just for `INTERSECT` (i.e. `INTERSECT DISTINCT`) and is not applicable to `INTERSECT ALL`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169617317
  
    **[Test build #48916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48916/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169578764
  
    **[Test build #2342 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2342/consoleFull)** for PR 10630 at commit [`6742984`](https://github.com/apache/spark/commit/6742984f5ec0d480435988712e161838c3e8d6d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49096615
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    Does our current intersect do de-dup?
    
    
    
    
    
    
    On Thu, Jan 7, 2016 at 8:50 AM -0800, "Xiao Li" <no...@github.com> wrote:
    
    
    
    
    
    
    
    
    
    
    
    
    In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
    > @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
    >  }
    >  
    >  /**
    > + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    > + * {{{
    > + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    > + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    > + * }}}
    > + */
    > +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    > +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    > +    case Intersect(left, right) =>
    > +      assert(left.output.size == right.output.size)
    > +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    > +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    
    
    
    You are right! It should remove duplicates. I just tried it in DB2. 
    
    
    
    —
    Reply to this email directly or view it on GitHub.
    
    
      
      
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176628957
  
    LGTM. we can merge it first and @gatorsmile can address remaining comments in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51233162
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ---
    @@ -214,12 +214,22 @@ trait CheckAnalysis {
                   s"""Only a single table generating function is allowed in a SELECT clause, found:
                      | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin)
     
    -          // Special handling for cases when self-join introduce duplicate expression ids.
    -          case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty =>
    -            val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    +          case j: Join if !j.duplicateResolved =>
    +            val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet)
                 failAnalysis(
                   s"""
    -                 |Failure when resolving conflicting references in Join:
    +                 |Failure when resolving conflicting references
    +                 |in operator ${operator.simpleString}:
    +                 |$plan
    +                 |Conflicting attributes: ${conflictingAttributes.mkString(",")}
    +                 |""".stripMargin)
    +
    +          case i: Intersect if !i.duplicateResolved =>
    +            val conflictingAttributes = i.left.outputSet.intersect(i.right.outputSet)
    +            failAnalysis(
    +              s"""
    +                 |Failure when resolving conflicting references
    +                 |in operator ${operator.simpleString}:
    --- End diff --
    
    same here, we could say `Intersect` directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169565371
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577796
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51041095
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    do we have test cases to prove that duplicated expression IDs in set operations will cause problems?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169916600
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169562691
  
    MS SQL Server did that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49045723
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    --- End diff --
    
    Yeah, it has a checking logic in analyzer:
    https://github.com/apache/spark/blob/d084a2de3271fd8b0d29ee67e1e214e8b9d96d6d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L186-L190


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-175432369
  
    **[Test build #50176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50176/consoleFull)** for PR 10630 at commit [`e566d79`](https://github.com/apache/spark/commit/e566d79b12ab1d9ed9e3124fec8290cdba99549b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51233135
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ---
    @@ -214,12 +214,22 @@ trait CheckAnalysis {
                   s"""Only a single table generating function is allowed in a SELECT clause, found:
                      | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin)
     
    -          // Special handling for cases when self-join introduce duplicate expression ids.
    -          case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty =>
    -            val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    +          case j: Join if !j.duplicateResolved =>
    +            val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet)
                 failAnalysis(
                   s"""
    -                 |Failure when resolving conflicting references in Join:
    --- End diff --
    
    now we can keep this message as it only checks join :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51163511
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    Will do. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51055023
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] {
     }
     
     /**
    +<<<<<<< HEAD
    --- End diff --
    
    will do. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176036669
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50257/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49093850
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -329,6 +329,22 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    // check null equality
    +    checkAnswer(
    +      nullInts.intersect(nullInts),
    +      Row(1) ::
    +      Row(2) ::
    +      Row(3) ::
    +      Row(null) :: Nil)
    +
    +    // check duplicate values
    +    checkAnswer(
    +      allNulls.intersect(allNulls),
    +      Row(null) ::
    +      Row(null) ::
    +      Row(null) ::
    +      Row(null) :: Nil)
    --- End diff --
    
    Is the expected answer right?
    
     Also, can you also add the test mentioned by @nongli (handling null is kind of special. So let's also have a regular case).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169757556
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48947/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49043951
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    --- End diff --
    
    LeftSemiJoin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577464
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51059167
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    In this case, it should work! Let me know if we should deduplicate the expression IDs for the other operators. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51061765
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    For other operators, can we construct test cases that can make them fail without de-duplication? If we can, then we should create a JIRA and fix it in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176653934
  
    Thank you! Just cleaned the codes. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51054458
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    After we convert Intersect to Join, the query can return a wrong answer due to the pushdown of Join predicate in Optimizer. That is why I did the change. Based on the comments https://github.com/apache/spark/pull/10630#discussion_r49227379  , I also tried to resolve this issue in the other operators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170208817
  
    **[Test build #49038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)** for PR 10630 at commit [`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169635220
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170220930
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49044/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170193612
  
    **[Test build #49039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176686339
  
    **[Test build #50368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50368/consoleFull)** for PR 10630 at commit [`b600089`](https://github.com/apache/spark/commit/b600089d4736e9768a8968fb4dead08d28014ac1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170200676
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49046904
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      val joinCond = left.output.zip(right.output).map { case (l, r) =>
    --- End diff --
    
    Added. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170200677
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49039/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169920448
  
    **[Test build #49008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49008/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169564022
  
    **[Test build #48900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48900/consoleFull)** for PR 10630 at commit [`0bd1771`](https://github.com/apache/spark/commit/0bd177135bc1c8210fa0ea5a8c7a4c5144d6227e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49227764
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -116,7 +116,18 @@ case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(lef
       }
     }
     
    -case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    +case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) {
    +  def duplicateResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty
    +
    +  // Intersect is only resolved if they don't introduce ambiguous expression ids,
    +  // since it will be converted to semi Join by optimizer.
    --- End diff --
    
    This check might belong more generally in `childrenResolved` as I mention above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169913011
  
    **[Test build #49007 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176507818
  
    **[Test build #50313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50313/consoleFull)** for PR 10630 at commit [`e51de8f`](https://github.com/apache/spark/commit/e51de8f98b25281b338ca57f0f13645b626c7673).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169618125
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48916/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174630543
  
    I don't think its a problem for there to be conflicting attribute ids for set operations, this is because only on childs attribute references need to be propagated up (unlike with a join).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-175469365
  
    **[Test build #50176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50176/consoleFull)** for PR 10630 at commit [`e566d79`](https://github.com/apache/spark/commit/e566d79b12ab1d9ed9e3124fec8290cdba99549b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49043974
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    --- End diff --
    
    just inline this into the rule, since it is used only once?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49046660
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      val joinCond = left.output.zip(right.output).map { case (l, r) =>
    --- End diff --
    
    one nit - might as well add assert(left.output.size == right.output.size)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169913342
  
    In this update, the code changes include
      - Added `DISTINCT` on top of `Semi Join` for `Intersect`
      - Added a test case for detecting if the duplicate values are removed after `Intersect`
      - Threw an exception if the `Intersect` is not converted before conversion from logical plan to physical plan
      - Merge two rules `ReplaceIntersectWithSemiJoin` and `ReplaceDistinctWithAggregate` in the same batch `"Replace Operators"` for ensuring `ReplaceIntersectWithSemiJoin` should be called before `ReplaceDistinctWithAggregate` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49043992
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    --- End diff --
    
    can you look into the analyzer to make sure we have the proper check there for intersect/except?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49047830
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -329,6 +329,14 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    // check null equality
    --- End diff --
    
    Do we have a test case where there are duplicates on the left side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170053763
  
    **[Test build #49020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49020/consoleFull)** for PR 10630 at commit [`04a26bd`](https://github.com/apache/spark/commit/04a26bd3beaab9d8ea3a4b300eb9f413b5eaa704).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170208865
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174665556
  
    Yeah, agree! Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169936843
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169575751
  
    LGTM, pending tests.
    
    It will be good to have some performance numbers for this change :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169936845
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49012/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169766656
  
    It's reasonable to support intersect all. We should do that as a separate ticket though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169738795
  
    Should we provide users one more Dataframe/SQL API `INTERSECT ALL`, which can be converted to `Left-semi Join`? Thank you! @yhuai @rxin @nongli @cloud-fan @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176482358
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170189725
  
    **[Test build #49038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)** for PR 10630 at commit [`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176487231
  
    **[Test build #50313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50313/consoleFull)** for PR 10630 at commit [`e51de8f`](https://github.com/apache/spark/commit/e51de8f98b25281b338ca57f0f13645b626c7673).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169580162
  
    **[Test build #48912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48912/consoleFull)** for PR 10630 at commit [`6742984`](https://github.com/apache/spark/commit/6742984f5ec0d480435988712e161838c3e8d6d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169917411
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169582123
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48911/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170052294
  
    The latest update is for resolving the ambiguous attributes, which could trigger invalid predicate pushdown by the Optimizer rule `PushPredicateThroughJoin`. The solution is the same as what we are doing for self join in analyzer. 
    
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51233504
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.optimizer
    +
    +import org.apache.spark.sql.catalyst.dsl.expressions._
    +import org.apache.spark.sql.catalyst.dsl.plans._
    +import org.apache.spark.sql.catalyst.plans.{LeftSemi, PlanTest}
    +import org.apache.spark.sql.catalyst.plans.logical._
    +import org.apache.spark.sql.catalyst.rules.RuleExecutor
    +
    +class ReplaceOperatorSuite extends PlanTest {
    +
    +  object Optimize extends RuleExecutor[LogicalPlan] {
    +    val batches =
    +      Batch("Intersect", FixedPoint(100),
    +        ReplaceIntersectWithSemiJoin) :: Nil
    +  }
    +
    +  test("replace Intersect with Left-semi Join") {
    --- End diff --
    
    also move the test for `ReplaceDistinctWithAggregate` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176656084
  
    LGTM, pending test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49043993
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    --- End diff --
    
    Yeah, will do it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169562078
  
    Which mainstream RDBMS is that?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49227485
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -344,6 +344,59 @@ class Analyzer(
           }
         }
     
    +    def buildRightChild4Deduplication (left: LogicalPlan, right: LogicalPlan): LogicalPlan = {
    --- End diff --
    
    extra space.  Also I'm not a huge fan of `4`.  How about `dedupRight` with some scala doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169931210
  
    **[Test build #49012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49012/consoleFull)** for PR 10630 at commit [`a932cdb`](https://github.com/apache/spark/commit/a932cdb05c889ded8de72024b2eef19723bdb0f4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49227379
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +441,14 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    +      // Special handling for cases when self-join introduces duplicate expression IDs.
           case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +        j.copy(right = buildRightChild4Deduplication(left, right))
    +
    +      // Special handling for cases when self-intersect introduces duplicate expression IDs.
    +      // In Optimizer, Intersect will be converted to semi join.
    +      case i @ Intersect(left, right) if !i.duplicateResolved =>
    +        i.copy(right = buildRightChild4Deduplication(left, right))
    --- End diff --
    
    Do we need to do this for all non-leaf/unary nodes?  I think the only reason this wasn't a problem in the past was there were no optimizer rules for those nodes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169591104
  
    **[Test build #48916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48916/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169916605
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49007/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577800
  
    LGTM.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r50901811
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] {
     
     /**
      * Pushes certain operations to both sides of a Union, Intersect or Except operator.
    +=======
    + * Pushes certain operations to both sides of a Union or Except operator.
    +>>>>>>> IntersectBySemiJoinMerged
    --- End diff --
    
    yeah, sure, will do. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49045093
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    +    left.output.zip(right.output).map { case (l, r) =>
    +      EqualNullSafe(l, r)
    +    }
    +  }
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    --- End diff --
    
    use transformUp?
    
    cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176686945
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50368/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170220857
  
    **[Test build #49044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49044/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176686941
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49239278
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +441,14 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    +      // Special handling for cases when self-join introduces duplicate expression IDs.
           case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +        j.copy(right = buildRightChild4Deduplication(left, right))
    +
    +      // Special handling for cases when self-intersect introduces duplicate expression IDs.
    +      // In Optimizer, Intersect will be converted to semi join.
    +      case i @ Intersect(left, right) if !i.duplicateResolved =>
    +        i.copy(right = buildRightChild4Deduplication(left, right))
    --- End diff --
    
    I see, will do a change to make it more general. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10630


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577952
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48909/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49094519
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    Looks like intersect also implies de-dup unless `INTERSECT ALL` is used (see http://www.postgresql.org/docs/9.4/static/queries-union.html and https://msdn.microsoft.com/en-us/library/ms188055.aspx). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49094926
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    You are right! It should remove duplicates. I just tried it in DB2. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49046484
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      val joinCond = left.output.zip(right.output).map { case (l, r) =>
    +        EqualNullSafe(l, r) }
    --- End diff --
    
    nit: can we put it in one line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49098454
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    I also checked the underlying RDD API: `Intersect`. It removes the duplicates. If we added `Distinct` above `Intersect`, we need to measure the performance difference, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169927019
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174171631
  
    **[Test build #49936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49936/consoleFull)** for PR 10630 at commit [`6a7979d`](https://github.com/apache/spark/commit/6a7979d35d5cf89d9cd14eb6d6a2a4418a44a669).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49239333
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -344,6 +344,59 @@ class Analyzer(
           }
         }
     
    +    def buildRightChild4Deduplication (left: LogicalPlan, right: LogicalPlan): LogicalPlan = {
    --- End diff --
    
    Sure. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174171694
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170013936
  
    Found the root cause of this failure. We also need to change the analyzer to handle duplicate expression ids. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169927020
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49008/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49239389
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -116,7 +116,18 @@ case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(lef
       }
     }
     
    -case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    +case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) {
    +  def duplicateResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty
    +
    +  // Intersect is only resolved if they don't introduce ambiguous expression ids,
    +  // since it will be converted to semi Join by optimizer.
    --- End diff --
    
    Will update the comments in the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169563386
  
    LGTM. 
    
    cc @cloud-fan to take a look too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49107829
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    You can just insert a distinct into this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169722176
  
    How about we hold on it a little bit? I am not sure the plan is right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169916531
  
    **[Test build #49007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174171695
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49936/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176441985
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176508353
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51084606
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -388,57 +445,18 @@ class Analyzer(
                 .map(_.asInstanceOf[NamedExpression])
             a.copy(aggregateExpressions = expanded)
     
    -      // Special handling for cases when self-join introduce duplicate expression ids.
    -      case j @ Join(left, right, _, _) if !j.selfJoinResolved =>
    -        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -        logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j")
    -
    -        right.collect {
    -          // Handle base relations that might appear more than once.
    -          case oldVersion: MultiInstanceRelation
    -              if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newVersion = oldVersion.newInstance()
    -            (oldVersion, newVersion)
    -
    -          // Handle projects that create conflicting aliases.
    -          case oldVersion @ Project(projectList, _)
    -              if findAliases(projectList).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(projectList = newAliases(projectList)))
    -
    -          case oldVersion @ Aggregate(_, aggregateExpressions, _)
    -              if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty =>
    -            (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))
    -
    -          case oldVersion: Generate
    -              if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty =>
    -            val newOutput = oldVersion.generatorOutput.map(_.newInstance())
    -            (oldVersion, oldVersion.copy(generatorOutput = newOutput))
    -
    -          case oldVersion @ Window(_, windowExpressions, _, _, child)
    -              if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes)
    -                .nonEmpty =>
    -            (oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))
    -        }
    -        // Only handle first case, others will be fixed on the next pass.
    -        .headOption match {
    -          case None =>
    -            /*
    -             * No result implies that there is a logical plan node that produces new references
    -             * that this rule cannot handle. When that is the case, there must be another rule
    -             * that resolves these conflicts. Otherwise, the analysis will fail.
    -             */
    -            j
    -          case Some((oldRelation, newRelation)) =>
    -            val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
    -            val newRight = right transformUp {
    -              case r if r == oldRelation => newRelation
    -            } transformUp {
    -              case other => other transformExpressions {
    -                case a: Attribute => attributeRewrites.get(a).getOrElse(a)
    -              }
    -            }
    -            j.copy(right = newRight)
    -        }
    +      // To resolve duplicate expression IDs for all the BinaryNode
    +      case b: BinaryNode if !b.duplicateResolved => b match {
    +        case j @ Join(left, right, _, _) =>
    +          j.copy(right = dedupRight(left, right))
    +        case i @ Intersect(left, right) =>
    +          i.copy(right = dedupRight(left, right))
    +        case e @ Except(left, right) =>
    +          e.copy(right = dedupRight(left, right))
    +        case cg: CoGroup =>
    --- End diff --
    
    @cloud-fan After code changes, I just realized the problem. If we replace intersect with left-semi join during analysis, it does not work. We have to de-duplicate the left and the right children at first; otherwise, the tree will be like
    ```
    == Analyzed Logical Plan ==
    id: int
    Distinct
    +- Join LeftSemi, Some((id#1 <=> id#1))
       :- Sample 0.0, 0.4, false, 1
       :  +- Sort [id#1 ASC], false
       :     +- Project [_1#0 AS id#1]
       :        +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
       +- Sample 0.4, 1.0, false, 1
          +- Sort [id#4 ASC], false
             +- Project [_1#0 AS id#4]
                +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
    ```
    The Join condition is invalid. Should we keep the rule in the optimizer? 
    
    BTW, I will remove the `Except` and `CoGroup`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169635154
  
    **[Test build #48912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48912/consoleFull)** for PR 10630 at commit [`6742984`](https://github.com/apache/spark/commit/6742984f5ec0d480435988712e161838c3e8d6d5).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169635223
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48912/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176441989
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50298/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176013909
  
    **[Test build #50257 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50257/consoleFull)** for PR 10630 at commit [`3be78c4`](https://github.com/apache/spark/commit/3be78c4197519edb4659d8064468125b5223dd29).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170189804
  
    In this update, the code changes include
      - Fixed a bug in `Range` by inheriting the trait `MultiInstanceRelation`.
      - Added a de-duplication resolution for all the binary nodes: `Except`, `Union` and `Co-Group`, besides `Intersect` and `Join`.
      - Added a new function `duplicateResolved` for all the binary nodes.
      - Improved the analysis exception message when failure to resolve conflicting references
      - Resolved all the other comments. 
    
    The analysis procedure is kind of tricky. I am unable to directly include `duplicateResolved` into `childrenResolved`. `resolve` is lazy evaluated. The resolution procedure need to follow the order: children at first, then itself, and then deduplicate the attributes' expression IDs in tis children.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49159670
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    Sure. Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-174165179
  
    When resolving the conflicts, I realized the multi-children `Union` might have duplicate `exprId`. So far, I did not add a function to de-duplicate them. This is not a trivial work, if needed. When the children has hundreds of nodes, it is infeasible to use the current per-pair de-duplication. That means, we need to rewrite the whole function `dedup`. 
    
    Let me know if we need to open a separate PR to do it now. So far, unlike `Intersect`, we did not hit any issue even if there exist duplicate `exprId` values. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49097867
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case Intersect(left, right) =>
    +      assert(left.output.size == right.output.size)
    +      val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) }
    +      Join(left, right, LeftSemi, joinCond.reduceLeftOption(And))
    --- End diff --
    
    Just ran it. The existing `Intersect` removes the duplicates. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49044052
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
       }
     
       test("intersect") {
    +    val intersectDF = lowerCaseData.intersect(lowerCaseData)
    +
    +    // Before Optimizer, the operator is Intersect
    --- End diff --
    
    this should go into one of the optimizer unit test suite, not here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176921008
  
    Thanks - I'm going to merge this.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169757219
  
    **[Test build #48947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48947/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169561594
  
    @rxin Please review the implementation. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169719510
  
    **[Test build #48947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48947/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170082149
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169927573
  
    Both failure are not related to this PR. They are caused by the test case `randomSplit on reordered partitions` in https://github.com/apache/spark/pull/10626. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176036575
  
    **[Test build #50257 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50257/consoleFull)** for PR 10630 at commit [`3be78c4`](https://github.com/apache/spark/commit/3be78c4197519edb4659d8064468125b5223dd29).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51233589
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala ---
    @@ -270,7 +270,8 @@ class AnalysisErrorSuite extends AnalysisTest {
         val error = intercept[AnalysisException] {
           SimpleAnalyzer.checkAnalysis(join)
         }
    -    assert(error.message.contains("Failure when resolving conflicting references in Join"))
    +    assert(error.message.contains("Failure when resolving conflicting references\n" +
    --- End diff --
    
    we can revert this after revert the error message change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169565368
  
    **[Test build #48900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48900/consoleFull)** for PR 10630 at commit [`0bd1771`](https://github.com/apache/spark/commit/0bd177135bc1c8210fa0ea5a8c7a4c5144d6227e).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49159667
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -329,6 +329,22 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    // check null equality
    +    checkAnswer(
    +      nullInts.intersect(nullInts),
    +      Row(1) ::
    +      Row(2) ::
    +      Row(3) ::
    +      Row(null) :: Nil)
    +
    +    // check duplicate values
    +    checkAnswer(
    +      allNulls.intersect(allNulls),
    +      Row(null) ::
    +      Row(null) ::
    +      Row(null) ::
    +      Row(null) :: Nil)
    --- End diff --
    
    Let me add it to the bucket. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176659772
  
    **[Test build #50368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50368/consoleFull)** for PR 10630 at commit [`b600089`](https://github.com/apache/spark/commit/b600089d4736e9768a8968fb4dead08d28014ac1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r50901898
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -1055,6 +1045,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] {
    --- End diff --
    
    Yeah, will do it. Actually, I will also implement `INTERSECT ALL` after this one. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170081871
  
    **[Test build #49020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49020/consoleFull)** for PR 10630 at commit [`04a26bd`](https://github.com/apache/spark/commit/04a26bd3beaab9d8ea3a4b300eb9f413b5eaa704).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170220929
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49045115
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -966,6 +952,27 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
     }
     
     /**
    + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
    + * {{{
    + *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    + *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
    + * }}}
    + */
    +object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {
    +  private def buildCond (left: LogicalPlan, right: LogicalPlan): Seq[Expression] = {
    +    require(left.output.length == right.output.length)
    +    left.output.zip(right.output).map { case (l, r) =>
    +      EqualNullSafe(l, r)
    +    }
    +  }
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    --- End diff --
    
    actually nvm.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r51040321
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] {
     }
     
     /**
    +<<<<<<< HEAD
    --- End diff --
    
    remove this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169577951
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169565373
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48900/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-170082147
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10630#discussion_r49044120
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
       }
     
       test("intersect") {
    +    val intersectDF = lowerCaseData.intersect(lowerCaseData)
    +
    +    // Before Optimizer, the operator is Intersect
    +    assert(intersectDF.queryExecution.analyzed.collect {
    +      case j@Intersect(_, _) => j
    +    }.size === 1)
    +
    +    // Before Optimizer, the operator is converted to LeftSemi Join
    +    assert(intersectDF.queryExecution.optimizedPlan.collect {
    +      case j@Join(_, _, LeftSemi, _) => j
    +    }.size === 1)
    +
         checkAnswer(
    -      lowerCaseData.intersect(lowerCaseData),
    +      intersectDF,
           Row(1, "a") ::
           Row(2, "b") ::
           Row(3, "c") ::
           Row(4, "d") :: Nil)
         checkAnswer(lowerCaseData.intersect(upperCaseData), Nil)
    +
    +    checkAnswer(
    --- End diff --
    
    ok, will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169582114
  
    @cloud-fan Sorry, I am still requesting a resource to do the performance testing. Unfortunately, I am unable to get it very recently. I am also highly interested in the performance number when running a large-scale workload in a distributed environment. Will keep you posted if it is done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169713016
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-175469557
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-169757412
  
    Also did a research in the SQL standard. Below is the SQL2003 Syntax
    ```
    <SELECT statement1>
    INTERSECT [ALL | DISTINCT]
    [CORRESPONDING [BY (column1, column2, ...)]] <SELECT statement2>
    INTERSECT [ALL | DISTINCT]
    [CORRESPONDING [BY (column1, column2, ...)]] 
    ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10630#issuecomment-176036668
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org