You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2015/02/11 04:32:08 UTC

[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/4520

    [SPARK-5454] More robust handling of self joins

    Also I fix a bunch of bad output in test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark selfJoin

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4520
    
----
commit 55d64b31bfa882a9bf502d02741a9fbb1d457237
Author: Michael Armbrust <mi...@databricks.com>
Date:   2015-02-11T03:30:49Z

    fix dataframe selfjoins

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73837250
  
      [Test build #27276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27276/consoleFull) for   PR 4520 at commit [`49c8e26`](https://github.com/apache/spark/commit/49c8e26868ac2e7c8a1e935f5929924e0cc64a02).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73829637
  
      [Test build #27270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27270/consoleFull) for   PR 4520 at commit [`55d64b3`](https://github.com/apache/spark/commit/55d64b31bfa882a9bf502d02741a9fbb1d457237).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73911621
  
    LGTM in general except some of the minor issues.
    
    My original thought on this, is adding a new `Project` on top of the `MultiInstanceRelation`(if it appears more than once in the query tree), so we can still keep the same reference to the original instance, this probably give us an opportunity to optimize the query by only computing once for the `MultiInstanceRelation`. 
    
    Anyway, let's leave it for the future improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73937987
  
      [Test build #27296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27296/consoleFull) for   PR 4520 at commit [`4f4a85c`](https://github.com/apache/spark/commit/4f4a85c6660d351ebecf660cae9a71edcf89d2b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4520#discussion_r24507245
  
    --- Diff: sql/core/src/test/resources/log4j.properties ---
    @@ -37,7 +37,10 @@ log4j.appender.FA.Threshold = INFO
     
     # Some packages are noisy for no good reason.
     log4j.additivity.parquet.hadoop.ParquetRecordReader=false
    -log4j.logger.parquet.hadoop.ParquetRecordReader=OFF
    +log4j.logger.parquet.hadoop.ParquetRecordReader=OFF\
    --- End diff --
    
    Typo with `\` at the end of line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73843431
  
      [Test build #27276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27276/consoleFull) for   PR 4520 at commit [`49c8e26`](https://github.com/apache/spark/commit/49c8e26868ac2e7c8a1e935f5929924e0cc64a02).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4520#discussion_r24507150
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -282,6 +279,29 @@ class Analyzer(catalog: Catalog,
               }
             )
     
    +      // Special handling for cases when self-join introduce duplicate expression ids.
    +      case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty =>
    +        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    +
    +        val (oldRelation, newRelation, attributeRewrites) = right.collect {
    +          case oldVersion: MultiInstanceRelation
    +            if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty=>
    --- End diff --
    
    space before `=>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73829638
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27270/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73951576
  
      [Test build #27296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27296/consoleFull) for   PR 4520 at commit [`4f4a85c`](https://github.com/apache/spark/commit/4f4a85c6660d351ebecf660cae9a71edcf89d2b5).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73830800
  
      [Test build #27271 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27271/consoleFull) for   PR 4520 at commit [`6fc38de`](https://github.com/apache/spark/commit/6fc38dec76044d97ed8621fabbac0f890a987b1c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73830143
  
      [Test build #27271 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27271/consoleFull) for   PR 4520 at commit [`6fc38de`](https://github.com/apache/spark/commit/6fc38dec76044d97ed8621fabbac0f890a987b1c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73829581
  
      [Test build #27270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27270/consoleFull) for   PR 4520 at commit [`55d64b3`](https://github.com/apache/spark/commit/55d64b31bfa882a9bf502d02741a9fbb1d457237).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4520#discussion_r24507183
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -282,6 +279,29 @@ class Analyzer(catalog: Catalog,
               }
             )
     
    +      // Special handling for cases when self-join introduce duplicate expression ids.
    +      case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty =>
    +        val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    +
    +        val (oldRelation, newRelation, attributeRewrites) = right.collect {
    +          case oldVersion: MultiInstanceRelation
    +            if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty=>
    +            val newVersion = oldVersion.newInstance()
    +            val newAttributes = AttributeMap(oldVersion.output.zip(newVersion.output))
    +
    --- End diff --
    
    remove a blank line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4520


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73830805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27271/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73951591
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27296/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5454] More robust handling of self join...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4520#issuecomment-73843441
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27276/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org