You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/14 06:25:18 UTC

[GitHub] [spark] mingjialiu opened a new pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

mingjialiu opened a new pull request #29564:
URL: https://github.com/apache/spark/pull/29564


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
    
   Override doCanonicalize function of class DataSourceV2ScanExec
   
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do.
   
   Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters)
   
   Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized. 
   
   [Spark 32708](https://issues.apache.org/jira/browse/SPARK-32708#) was not caught by my [tests](https://github.com/apache/spark/blob/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L392) previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg :  join table t1 with t1)
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   no
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   unit test added


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682237643


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu closed pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu closed pull request #29564:
URL: https://github.com/apache/spark/pull/29564


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690036144


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691190559






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867354






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691190559






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690599004


   Hi,
   
   Regarding test coverage,  it's a bit tricky to repro in a unit test.
   Can I get some pointers on populating different expression ids for the same
   column? Or test suggestions?
   
      - CAN'T repro example in unit test:
   
       val df = spark.read.format(classOf[AdvancedDataSourceV2].getName).load()
       val q1 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i >
   5)
       val q2 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i >
   5)
       val scans1 = getV2ScanExecs(q1.join(q2, "j"))
       assert(scans1(0).sameResult(scans1(1)))
   
      scans1(0).sameResult(scans1(1)) will always return true even if filtered
   columns are not properly canonicalized (as circled in screenshots)
   
   
   
   [image: other_canonicalized.png]
   
   [image: this_canonicalized.png]
   
   
   
   
      - CAN repro query :  SPARK-32708
      <https://issues.apache.org/jira/browse/SPARK-32708>
   
      The key to repro is to have d_day_name and d_year  assigned different
   expression Ids.
      Relative implementation : preserve old expressionId if column not found
   <https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285>
   
   
   Thank you,
   Mingjia
   
   
   
   
   On Wed, Sep 9, 2020 at 11:05 PM Wenchen Fan <no...@github.com>
   wrote:
   
   > The fix LGTM, can you add a test?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/29564#issuecomment-690008778>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AEMVJMYZD527O7EY27RUISLSFBUCZANCNFSM4QNQGG5Q>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486063202



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
##########
@@ -52,6 +53,17 @@ case class DataSourceV2ScanExec(
     case _ => false
   }
 
+  override def doCanonicalize(): DataSourceV2ScanExec = {
+    DataSourceV2ScanExec(
+      output.map(QueryPlan.normalizeExprId(_, output)),
+      source,
+      options,
+      QueryPlan.normalizePredicates(
+        pushedFilters,
+        AttributeSeq(pushedFilters.flatMap(_.references).distinct)),

Review comment:
       Output here contains only projected column due to implementation at : https://sourcegraph.com/github.com/apache/spark@branch-2.4/-/blob/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L108




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690818602


   **[Test build #128543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128543/testReport)** for PR 29564 at commit [`8b864e7`](https://github.com/apache/spark/commit/8b864e7921061958c43006335196898e6f3c4be8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691884408


   Thanks, merging to branch 2.4


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689964565


   @mingjialiu the new fix looks more reasonable. Could you add test case for the changes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867723






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696304840






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243604


   **[Test build #127968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127968/testReport)** for PR 29564 at commit [`704efbc`](https://github.com/apache/spark/commit/704efbc559a351ebe1aa025de1f608270e81594a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486061776



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
##########
@@ -52,6 +53,17 @@ case class DataSourceV2ScanExec(
     case _ => false
   }
 
+  override def doCanonicalize(): DataSourceV2ScanExec = {
+    DataSourceV2ScanExec(
+      output.map(QueryPlan.normalizeExprId(_, output)),
+      source,
+      options,
+      QueryPlan.normalizePredicates(
+        pushedFilters,
+        AttributeSeq(pushedFilters.flatMap(_.references).distinct)),

Review comment:
       should we use `output`  here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-692137840


   @gengliangwang . Could you update `Fix Version` of SPARK-32708?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243887






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu edited a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu edited a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690597364


   > The fix LGTM, can you add a test?
   
   Regarding test coverage,  it's a bit tricky to repro in a unit test.  Can I get some pointers on populating different expression ids for the same column? Or test suggestions?
   
   The key to repro is to have the same column assigned different expression Ids. 
   Relative implementation : [preserve old expressionId if column not found]( https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285)
   Explained details in email. 
   
      
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690873976


   **[Test build #128543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128543/testReport)** for PR 29564 at commit [`8b864e7`](https://github.com/apache/spark/commit/8b864e7921061958c43006335196898e6f3c4be8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690008778


   The fix LGTM, can you add a test?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682294114


   **[Test build #127968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127968/testReport)** for PR 29564 at commit [`704efbc`](https://github.com/apache/spark/commit/704efbc559a351ebe1aa025de1f608270e81594a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486063202



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
##########
@@ -52,6 +53,17 @@ case class DataSourceV2ScanExec(
     case _ => false
   }
 
+  override def doCanonicalize(): DataSourceV2ScanExec = {
+    DataSourceV2ScanExec(
+      output.map(QueryPlan.normalizeExprId(_, output)),
+      source,
+      options,
+      QueryPlan.normalizePredicates(
+        pushedFilters,
+        AttributeSeq(pushedFilters.flatMap(_.references).distinct)),

Review comment:
       Output here doesn't contain all predicate columns due to implementation at : https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L108




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690597364


   > The fix LGTM, can you add a test?
   
   Hi, it's a bit tricky to repro in unit test. Can I get some pointers on populating different expression ids for the same column?
   
   - CAN'T repro example in unit test: 
   
       val df = spark.read.format(classOf[AdvancedDataSourceV2].getName).load()
       val q1 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i > 5)
       val q2 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i > 5)
       val scans1 = getV2ScanExecs(q1.join(q2, "j"))
       assert(scans1(0).sameResult(scans1(1))) 
   
      scans1(0).sameResult(scans1(1)) will always return true even if filtered columns are not properly canonicalized (as circled in screenshots)
      
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690036153


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128481/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486821529



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243113


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690036144






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696476266


   @dongjoon-hyun Sorry, it's updated now. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486905881



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690809986


   **[Test build #128541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128541/testReport)** for PR 29564 at commit [`98483c8`](https://github.com/apache/spark/commit/98483c82151634760a82fc860ac6de26d4b024ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690809986






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689960789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682237259


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-692137446


   Thank you, @mingjialiu and all!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486063202



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
##########
@@ -52,6 +53,17 @@ case class DataSourceV2ScanExec(
     case _ => false
   }
 
+  override def doCanonicalize(): DataSourceV2ScanExec = {
+    DataSourceV2ScanExec(
+      output.map(QueryPlan.normalizeExprId(_, output)),
+      source,
+      options,
+      QueryPlan.normalizePredicates(
+        pushedFilters,
+        AttributeSeq(pushedFilters.flatMap(_.references).distinct)),

Review comment:
       Output here contains only projected column due to implementation at : https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L108




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696476266


   @dongjoon-hyun Sorry, it's updated now. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486905881



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689963685


   > In Branch 3.0, there is a mixed-in trait `SupportsPushDownFilters` which is introduced by #19136 and #19424 .
   > 
   > However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well.
   > @mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
   > cc @cloud-fan as well.
   
   My org still relies heavily on 2.4


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696476266


   @dongjoon-hyun Sorry, it's updated now. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691147266






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #29564:
URL: https://github.com/apache/spark/pull/29564


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690816948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691147266






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu edited a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu edited a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690597364


   > The fix LGTM, can you add a test?
   
   Regarding test coverage,  it's a bit tricky to repro in a unit test.  Can I get some pointers on populating different expression ids for the same column? Or test suggestions?
   
   The key to repro is to have the same column assigned different expression Ids. 
   Relative implementation : [preserve old expressionId if column not found]( https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285)
   More details explained details in email. 
   
      
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu edited a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu edited a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690597364


   > The fix LGTM, can you add a test?
   
   Hi, it's a bit tricky to repro in unit test. Can I get some pointers on populating different expression ids for the same column?
   
   
      
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690809986






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690816948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r482172221



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -54,7 +55,7 @@ case class DataSourceV2Relation(
     tableIdent.map(_.unquotedString).getOrElse(s"${source.name}:unknown")
   }
 
-  override def pushedFilters: Seq[Expression] = Seq.empty
+  override def pushedFilters: Seq[Filter] = Seq.empty

Review comment:
       More explanation. Why changing from Expression to org.apache.spark.sql.sources.Filter
   
   DataSourceV2ScanExec.pushedFilters are defined as array of Expressions whose equal function has expression_id in scope. So for example, Expression isnotnull(d_day_name#22364) is not considered equal to isnotnull(d_day_name#22420). Therefore, the right thing is to define and compare pushedFilter as Filter class.
   
   At both Spark 3.0 and affected Spark 2.4's tests suite, Filter is the class being used. And the above 4 places seem to be the only places that miss to have pushedFilter as class Filter.
   (Because pushedFilters are defined the right way in the above test suite, Spark 32708 was not caught by my tests previously added for SPARK-32609, another exchange reuse bug.
   
   Usage of Expression was introduced by PR [SPARK-23203][SQL] DataSourceV2: Use immutable logical plan. From the PR's description and original intention, I don't see a necessary reason to maintain Expression.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690816948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696304840


   @gengliangwang . The `Fix Version` seems to be empty still.
   ![Screen Shot 2020-09-21 at 11 53 38 AM](https://user-images.githubusercontent.com/9700541/93808663-24282900-fc01-11ea-8d3b-f786a887b648.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682237259


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691201992






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690810383






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867723






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690874399






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691201992


   I think we can rely on the jenkins test result as well


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-692144693


   @dongjoon-hyun sure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690035166


   **[Test build #128481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128481/testReport)** for PR 29564 at commit [`a6e4709`](https://github.com/apache/spark/commit/a6e4709fed67416a73f9b8635d8fc7be4e412754).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486905881



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690818602






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689963482


   > If exchange reuse is broken, it means plan equality is broken somewhere. I think `Seq[Expression]` is OK as long as we canonicalize it before comparing it. `FileSourceScanExec` also contains `Seq[Expression]` and it's fine.
   > 
   > Can you look into it more and have a more surgical fix?
   
   Updated with a more surgical fix. Please review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486821529



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243604


   **[Test build #127968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127968/testReport)** for PR 29564 at commit [`704efbc`](https://github.com/apache/spark/commit/704efbc559a351ebe1aa025de1f608270e81594a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691146396


   **[Test build #128574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128574/testReport)** for PR 29564 at commit [`acadafe`](https://github.com/apache/spark/commit/acadafee79d1f3f7c86b182740ad72f4dcfafc5c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691201992






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-688222654


   In Branch 3.0, there is a mixed-in trait `SupportsPushDownFilters` which is introduced by https://github.com/apache/spark/pull/19136 and https://github.com/apache/spark/pull/19424 .
   
   However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well. 
   @mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
   cc @cloud-fan as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r478752285



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -54,7 +55,7 @@ case class DataSourceV2Relation(
     tableIdent.map(_.unquotedString).getOrElse(s"${source.name}:unknown")
   }
 
-  override def pushedFilters: Seq[Expression] = Seq.empty
+  override def pushedFilters: Seq[Filter] = Seq.empty

Review comment:
       Why we need to change Expression to Filter, which is a public interface?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682237643


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690818602


   **[Test build #128543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128543/testReport)** for PR 29564 at commit [`8b864e7`](https://github.com/apache/spark/commit/8b864e7921061958c43006335196898e6f3c4be8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696481548


   Thank you, @gengliangwang !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #29564: [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243583


   2.4? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691843273


   @cloud-fan @gatorsmile  Hi, it looks to me that 1 approval is not enough for merging. Can you please approve this PR if everything looks good to you? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690811936






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689960789






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682284978


   If the issue does not happen in branch-3.0+, I think we might need to check which commit's resolved it from the commit history first. If it found, we might be able to just cherry-pick it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486905881



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       @cloud-fan I think this test case creates two `DataSourceV2ScanExec` and do the check. It looks ok to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690810458


   > > The fix LGTM, can you add a test?
   > 
   > Regarding test coverage,  it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?
   > 
   > The key to repro is to have the same column assigned different expression Ids.
   > Relative implementation : [preserve old expressionId if column not found](https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285)
   > Explained details in email.
   
   
   
   > The fix LGTM, can you add a test?
   
   Test added. Please review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682243887






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690810383






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691146396


   **[Test build #128574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128574/testReport)** for PR 29564 at commit [`acadafe`](https://github.com/apache/spark/commit/acadafee79d1f3f7c86b182740ad72f4dcfafc5c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691240161


   **[Test build #128574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128574/testReport)** for PR 29564 at commit [`acadafe`](https://github.com/apache/spark/commit/acadafee79d1f3f7c86b182740ad72f4dcfafc5c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691190559


   test **[pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction](https://github.com/apache/spark/blob/branch-2.4/python/pyspark/mllib/tests.py#L1487)** is failing. @gatorsmile @cloud-fan @gengliangwang  do you think the failure is related to this change?  If yes, any suggestions on how to fix it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486821529



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu closed pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu closed pull request #29564:
URL: https://github.com/apache/spark/pull/29564


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690816948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691240572






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689960384


   **[Test build #128481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128481/testReport)** for PR 29564 at commit [`a6e4709`](https://github.com/apache/spark/commit/a6e4709fed67416a73f9b8635d8fc7be4e412754).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-689960384


   **[Test build #128481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128481/testReport)** for PR 29564 at commit [`a6e4709`](https://github.com/apache/spark/commit/a6e4709fed67416a73f9b8635d8fc7be4e412754).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-696481548


   Thank you, @gengliangwang !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867723






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-691240572






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690874399






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29564:
URL: https://github.com/apache/spark/pull/29564#discussion_r486821529



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##########
@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
       }
     }
   }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {

Review comment:
       If we don't have an end-to-end test, how about a low-level UT? Create two `DataSourceV2ScanExec` instances and check `scan1.sameResult(scan2)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682294491






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690809986


   **[Test build #128541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128541/testReport)** for PR 29564 at commit [`98483c8`](https://github.com/apache/spark/commit/98483c82151634760a82fc860ac6de26d4b024ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867723






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-682294491






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu removed a comment on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu removed a comment on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690599004


   Hi,
   
   Regarding test coverage,  it's a bit tricky to repro in a unit test.
   Can I get some pointers on populating different expression ids for the same
   column? Or test suggestions?
   
      - CAN'T repro example in unit test:
   
       val df = spark.read.format(classOf[AdvancedDataSourceV2].getName).load()
       val q1 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i >
   5)
       val q2 = df.select(($"i" + 1).as("k"), ($"i" - 1).as("j")).filter('i >
   5)
       val scans1 = getV2ScanExecs(q1.join(q2, "j"))
       assert(scans1(0).sameResult(scans1(1)))
   
      scans1(0).sameResult(scans1(1)) will always return true even if filtered
   columns are not properly canonicalized (as circled in screenshots)
   
   
   
   [image: other_canonicalized.png]
   
   [image: this_canonicalized.png]
   
   
   
   
      - CAN repro query :  SPARK-32708
      <https://issues.apache.org/jira/browse/SPARK-32708>
   
      The key to repro is to have d_day_name and d_year  assigned different
   expression Ids.
      Relative implementation : preserve old expressionId if column not found
   <https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285>
   
   
   Thank you,
   Mingjia
   
   
   
   
   On Wed, Sep 9, 2020 at 11:05 PM Wenchen Fan <no...@github.com>
   wrote:
   
   > The fix LGTM, can you add a test?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/29564#issuecomment-690008778>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AEMVJMYZD527O7EY27RUISLSFBUCZANCNFSM4QNQGG5Q>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690867354


   **[Test build #128541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128541/testReport)** for PR 29564 at commit [`98483c8`](https://github.com/apache/spark/commit/98483c82151634760a82fc860ac6de26d4b024ba).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mingjialiu commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
mingjialiu commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-690811936


   > > The fix LGTM, can you add a test?
   > 
   > Regarding test coverage,  it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?
   > 
   > The key to repro is to have the same column assigned different expression Ids.
   > Relative implementation : [preserve old expressionId if column not found](https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L285)
   > Explained details in email.
   
   Please ignore this message. I figured out that column's expression id is consistent within the same df.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29564: [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29564:
URL: https://github.com/apache/spark/pull/29564#issuecomment-688289169


   If exchange reuse is broken, it means plan equality is broken somewhere. I think `Seq[Expression]` is OK as long as we canonicalize it before comparing it. `FileSourceScanExec` also contains `Seq[Expression]` and it's fine.
   
   Can you look into it more and have a more surgical fix?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org