You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ericl <gi...@git.apache.org> on 2016/07/30 22:04:20 UTC

[GitHub] spark pull request #14425: [SPARK-16818] Exchange reuse incorrectly reuses s...

GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/14425

    [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of partitions

    ## What changes were proposed in this pull request?
    
    This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation.
    
    The patch here is minimal, but maybe we should reconsider using a metadata map for implementing sameResult() in the future.
    
    cc @rxin 
    
    ## How was this patch tested?
    
    Unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark spark-16818

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14425
    
----
commit e7e545fd8f7455a653c2bcee4e42a0e5249791f9
Author: Eric Liang <ek...@databricks.com>
Date:   2016-07-30T22:02:48Z

    Sat Jul 30 15:02:48 PDT 2016

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    @ericl there is a conflict with branch-2.0. Can you create a pull request for branch-2.0?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    **[Test build #63047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63047/consoleFull)** for PR 14425 at commit [`a254540`](https://github.com/apache/spark/commit/a2545408d144c5ea87ce5696341fe52cd2d29d2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    LGTM (assuming the test case would fail without the fix)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    Yep, both fail prior to the fix.
    
    On Sat, Jul 30, 2016, 3:32 PM Reynold Xin <no...@github.com> wrote:
    
    > LGTM (assuming the test case would fail without the fix)
    >
    > \u2014
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/14425#issuecomment-236393480>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAA6SgCkQqr5hfHqbT7FCL0ttYdqWOtRks5qa9EQgaJpZM4JY51Y>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    Done, see https://github.com/apache/spark/pull/14427


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14425: [SPARK-16818] Exchange reuse incorrectly reuses s...

Posted by ericl <gi...@git.apache.org>.
Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14425#discussion_r72895545
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -130,7 +130,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
               createNonBucketedReadRDD(readFile, selectedPartitions, fsRelation)
           }
     
    +      // These metadata values make scan plans uniquely identifiable for equality checking.
           val meta = Map(
    --- End diff --
    
    Agreed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14425: [SPARK-16818] Exchange reuse incorrectly reuses s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14425#discussion_r72895137
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -130,7 +130,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
               createNonBucketedReadRDD(readFile, selectedPartitions, fsRelation)
           }
     
    +      // These metadata values make scan plans uniquely identifiable for equality checking.
           val meta = Map(
    --- End diff --
    
    not related to this pr - i think we should remove the concept of metadata entirely from physical plans. It is something added as a hack to propagate the following information, which really should just be a named field in those case classes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63047/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14425: [SPARK-16818] Exchange reuse incorrectly reuses s...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14425


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14425: [SPARK-16818] Exchange reuse incorrectly reuses scans ov...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14425
  
    **[Test build #63047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63047/consoleFull)** for PR 14425 at commit [`a254540`](https://github.com/apache/spark/commit/a2545408d144c5ea87ce5696341fe52cd2d29d2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14425: [SPARK-16818] Exchange reuse incorrectly reuses s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14425#discussion_r72895124
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala ---
    @@ -408,6 +408,39 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi
         }
       }
     
    +  test("[SPARK-16818] partition pruned file scans implement sameResult correctly") {
    +    withTempPath { path =>
    +      val tempDir = path.getCanonicalPath
    +      spark.range(100)
    +        .selectExpr("id", "id as b")
    +        .write
    +        .partitionBy("id")
    +        .parquet(tempDir)
    +      val df = spark.read.parquet(tempDir)
    +      def getPlan(df: DataFrame): SparkPlan = {
    +        df.queryExecution.executedPlan
    +      }
    +      assert(getPlan(df.where("id = 2")).sameResult(getPlan(df.where("id = 2"))))
    --- End diff --
    
    did you verify this would fail without your patch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org