You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/14 16:51:04 UTC

[GitHub] [spark] viirya opened a new pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

viirya opened a new pull request #29107:
URL: https://github.com/apache/spark/pull/29107


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase.
   See https://github.com/apache/spark/pull/28996#discussion_r453460284.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660423538


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126077/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658481455






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662629745






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662851219






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662840761


   **[Test build #126388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126388/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662286329


   **[Test build #126311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126311/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663348711


   Thanks for review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663215454






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660381500


   **[Test build #126077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126077/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662286804






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660427670


   **[Test build #126093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126093/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660440464


   **[Test build #126093 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126093/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mukulmurthy commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
mukulmurthy commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-661311103


   I just linked the JIRA to SPARK-29358 so it's easier to track the lineage.
   
   Is there already another follow up ticket to make this work for nested structs? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662436463






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mukulmurthy commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
mukulmurthy commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-661372139


   Awesome, I just created https://issues.apache.org/jira/browse/SPARK-32376.
   
   We've implemented a utility to do this before; I'll sync with my team and see if it's easy to port the code over and open a PR for that unless you're already planning on tackling it. Even if so, we're happy to share our test cases. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458922812



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +

Review comment:
       I add `!(byName || allowMissingCol)` to the condition at `Union.resolved` now. Then do we still need to check them at `CheckAnalysis` like above? Or just follow other nodes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662761539






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663215454






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660440516


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126093/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658354245


   **[Test build #125853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125853/testReport)** for PR 29107 at commit [`8e4867a`](https://github.com/apache/spark/commit/8e4867ac334c3b84154f75c364af44d203309cd4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663049879






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-661331758


   > I just linked the JIRA to [SPARK-29358](https://issues.apache.org/jira/browse/SPARK-29358) so it's easier to track the lineage.
   > 
   > Is there already another follow up ticket to make this work for nested structs?
   
   Thanks. I don't see an existing follow up ticket for nested structs.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662092632






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458368672



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {

Review comment:
       sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662092167


   **[Test build #126277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126277/testReport)** for PR 29107 at commit [`1839987`](https://github.com/apache/spark/commit/1839987d63fc94a71e064f82b57b400eaa7fb220).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-661406076


   Thanks @mukulmurthy. I've not worked on it yet. Feel free to open a PR if you are ready to port the code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663049326


   **[Test build #126417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126417/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660381783






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660427858






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #29107:
URL: https://github.com/apache/spark/pull/29107


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662282455


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126294/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658354949






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662231827


   **[Test build #126294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126294/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458916665



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##########
@@ -683,7 +683,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
         execution.LocalLimitExec(limit, planLater(child)) :: Nil
       case logical.GlobalLimit(IntegerLiteral(limit), child) =>
         execution.GlobalLimitExec(limit, planLater(child)) :: Nil
-      case logical.Union(unionChildren) =>
+      case logical.Union(unionChildren, _, _) =>

Review comment:
       oh ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r455725949



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       yea maybe a new analyzer rule.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662436463






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662557134


   **[Test build #126348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126348/testReport)** for PR 29107 at commit [`0381f5d`](https://github.com/apache/spark/commit/0381f5db317e848851e661d27d6f7a35eb7d637f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662831333






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660536317






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458635997



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +

Review comment:
       To be safe, shall we override `Union.resolved` to include this?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +

Review comment:
       To be safe, shall we also override `Union.resolved` to include this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658292521


   **[Test build #125845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125845/testReport)** for PR 29107 at commit [`93c5ea1`](https://github.com/apache/spark/commit/93c5ea1973c062f721693fc3cc8419a336796beb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662851223


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126388/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662285441


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658293153






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r459212692



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -271,7 +281,7 @@ case class Union(children: Seq[LogicalPlan]) extends LogicalPlan {
         child.output.zip(children.head.output).forall {
           case (l, r) => l.dataType.sameType(r.dataType)
         })
-    children.length > 1 && childrenResolved && allChildrenCompatible
+    children.length > 1 && childrenResolved && allChildrenCompatible && !(byName || allowMissingCol)

Review comment:
       nit: `!(byName || allowMissingCol)` should be checked before `childrenResolved`, as it's cheaper.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660500244






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458470991



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
+import org.apache.spark.sql.catalyst.optimizer.CombineUnions
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.util.SchemaUtils
+
+/**
+ * Resolves different children of Union to a common set of columns.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +

Review comment:
       How about making it consistent (using ` for wrapping a column name)?
   ```
             throw new AnalysisException(
               s"Cannot resolve column name `${lattr.name}` among " +
                 s"(${rightOutputAttrs.map(_.name).mkString(", ")})")
   ```
   https://github.com/apache/spark/pull/29107/files#diff-1d14ac233eac6f233c027dba0bdf871dR341

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
+import org.apache.spark.sql.catalyst.optimizer.CombineUnions
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.util.SchemaUtils
+
+/**
+ * Resolves different children of Union to a common set of columns.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)
+  }
+
+  // Check column name duplication
+  private def checkColumnNames(left: LogicalPlan, right: LogicalPlan): Unit = {
+    val caseSensitiveAnalysis = SQLConf.get.caseSensitiveAnalysis
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    SchemaUtils.checkColumnNameDuplication(
+      leftOutputAttrs.map(_.name),
+      "in the left attributes",
+      caseSensitiveAnalysis)
+    SchemaUtils.checkColumnNameDuplication(
+      rightOutputAttrs.map(_.name),
+      "in the right attributes",
+      caseSensitiveAnalysis)
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+    case e if !e.childrenResolved => e
+
+    case Union(children, byName, allowMissingCol) if byName =>
+      val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>

Review comment:
       nit: `{ (left: LogicalPlan, right: LogicalPlan) =>` -> `{ (left, right) =>`

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -342,9 +342,10 @@ object TypeCoercion {
         Intersect(newChildren.head, newChildren.last, isAll)
 
       case s: Union if s.childrenResolved &&
-          s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved =>
+          s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved &&
+          !s.byName =>

Review comment:
       nit: Since `s.byName` is constant, how about evaluating it first?
   ```
         case s @ Union(_, false, _) if s.childrenResolved && !s.byName &&
             s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved =>
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658481104


   **[Test build #125853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125853/testReport)** for PR 29107 at commit [`8e4867a`](https://github.com/apache/spark/commit/8e4867ac334c3b84154f75c364af44d203309cd4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458922812



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +

Review comment:
       I add `!(byName || allowMissingCol)` to the condition at `Union.resolved` now. Then do we still need to check them at `CheckAnalysis` like above? Or just follow other nodes with special check?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660423535






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663049326


   **[Test build #126417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126417/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662192692






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r456742634



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +
+              "`allowMissingCol` flags after analysis phase.")

Review comment:
       Usually not. This mainly prevents we accidentally create a Union with byName or allowMissingCol after ResolveUnion rule.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458636985



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##########
@@ -683,7 +683,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
         execution.LocalLimitExec(limit, planLater(child)) :: Nil
       case logical.GlobalLimit(IntegerLiteral(limit), child) =>
         execution.GlobalLimitExec(limit, planLater(child)) :: Nil
-      case logical.Union(unionChildren) =>
+      case logical.Union(unionChildren, _, _) =>

Review comment:
       this pattern appears many times (`Union(children, _, _)`), and actually we can rewrite to
   ```
   case union: logical.Union =>
     execution.UnionExec(union.children.map(planLater)) :: Nil
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658398511


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660423535


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r454844933



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       Just a question; the process of `unionTwoSides` is a type-coercion one? Seems like column resolution rather than type coercion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458962281



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +

Review comment:
       Since this is not something users can hit(only bug can trigger this), maybe we can just remove this check here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458037373



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+    case e if !e.childrenResolved => e
+
+    case Union(children, byName, allowMissingCol)
+      if byName =>

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458909616



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1387,7 +1388,7 @@ class Analyzer(
         i.copy(right = dedupRight(left, right))
       case e @ Except(left, right, _) if !e.duplicateResolved =>
         e.copy(right = dedupRight(left, right))
-      case u @ Union(children) if !u.duplicateResolved =>
+      case u @ Union(children, _, _) if !u.duplicateResolved =>

Review comment:
       `duplicateResolved` checks attribute sets from children that look at `exprId` actually. By-name resolution is only for attribute name. If you think it is safer, I can add it here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r456742673



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +
+              "`allowMissingCol` flags after analysis phase.")

Review comment:
       And, yes, prevent a unexpected bug during analysis.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+    case e if !e.childrenResolved => e
+
+    case Union(children, byName, allowMissingCol)
+      if byName =>

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662557134


   **[Test build #126348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126348/testReport)** for PR 29107 at commit [`0381f5d`](https://github.com/apache/spark/commit/0381f5db317e848851e661d27d6f7a35eb7d637f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662286804






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662435584


   **[Test build #126311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126311/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662557943






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458962571



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
##########
@@ -50,7 +50,7 @@ object PropagateEmptyRelation extends Rule[LogicalPlan] with PredicateHelper wit
   override def conf: SQLConf = SQLConf.get
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case p @ Union(children) if children.exists(isEmptyLocalRelation) =>
+    case p @ Union(children, _, _) if children.exists(isEmptyLocalRelation) =>

Review comment:
       You are right




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458042583



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -928,19 +928,28 @@ object CombineUnions extends Rule[LogicalPlan] {
   }
 
   private def flattenUnion(union: Union, flattenDistinct: Boolean): Union = {
+    val topByName = union.byName
+    val topAllowMissingCol = union.allowMissingCol
+
     val stack = mutable.Stack[LogicalPlan](union)
     val flattened = mutable.ArrayBuffer.empty[LogicalPlan]
+    // Note that we should only flatten the unions with same byName and allowMissingCol.
+    // Although we do `UnionCoercion` at analysis phase, we manually run `CombineUnions`
+    // in some places like `Dataset.union`. Flattening unions with different resolution
+    // rules (by position and by name) could cause incorrect results.
     while (stack.nonEmpty) {
       stack.pop() match {
-        case Distinct(Union(children)) if flattenDistinct =>
+        case Distinct(Union(children, byName, allowMissingCol))
+            if flattenDistinct && byName == topByName && allowMissingCol == topAllowMissingCol =>
           stack.pushAll(children.reverse)
-        case Union(children) =>
+        case Union(children, byName, allowMissingCol)
+            if byName == topByName && allowMissingCol == topAllowMissingCol =>
           stack.pushAll(children.reverse)
         case child =>
           flattened += child
       }
     }
-    Union(flattened.toSeq)
+    Union(flattened, topByName, topAllowMissingCol)

Review comment:
       ditto: use `copy`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458044995



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnionSuite.scala
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types._
+
+class ResolveUnionSuite extends AnalysisTest {
+  test("Resolve Union") {
+    val table1 = LocalRelation(
+      AttributeReference("i", IntegerType)(),
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)())
+    val table2 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+    val table3 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+
+    val rules = Seq(ResolveUnion)
+    val analyzer = new RuleExecutor[LogicalPlan] {
+      override val batches = Seq(Batch("Resolution", Once, rules: _*))
+    }
+
+    // By name resolution
+    val union1 = Union(table1 :: table2 :: Nil, true, false)
+    val analyzed1 = analyzer.execute(union1)
+    val projected1 =
+      Project(Seq(table2.output(3), table2.output(0), table2.output(1), table2.output(2)), table2)
+    val expected1 = Union(table1 :: projected1 :: Nil)
+    comparePlans(analyzed1, expected1)
+
+    // Allow missing column
+    val union2 = Union(table1 :: table3 :: Nil, true, true)
+    val analyzed2 = analyzer.execute(union2)
+    val nullAttr = Alias(Literal(null, ByteType), "b")()
+    val projected2 =
+      Project(Seq(table2.output(3), table2.output(0), nullAttr, table2.output(2)), table3)
+    val expected2 = Union(table1 :: projected2 :: Nil)
+    comparePlans(analyzed2, expected2)
+
+    // By name + Allow missing column
+    val union3 = Union(union1 :: union2 :: Nil)
+    val analyzed3 = analyzer.execute(union3)
+    val expected3 = Union(expected1 :: expected2 :: Nil)

Review comment:
       does this test prove anything? `union1` and `union2` have exactly the same schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662851219


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660500087


   **[Test build #126112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126112/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662761539






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658293153






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662232123






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458365264



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved

Review comment:
       Ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662629194


   **[Test build #126355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126355/testReport)** for PR 29107 at commit [`2ab990e`](https://github.com/apache/spark/commit/2ab990e0bab93d8092a6b9090eab8c746462f940).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r454848087



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       Seems not like typical type coercion case, but put here based on https://github.com/apache/spark/pull/28996#discussion_r453460284.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663045940


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458038241



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)

Review comment:
       can we avoid creating intermediate unions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662092167


   **[Test build #126277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126277/testReport)** for PR 29107 at commit [`1839987`](https://github.com/apache/spark/commit/1839987d63fc94a71e064f82b57b400eaa7fb220).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662232123






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662092632






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662231827


   **[Test build #126294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126294/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662282448


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458045764



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2085,37 +2084,9 @@ class Dataset[T] private[sql](
       "in the right attributes",

Review comment:
       shall we move the check to the analyzer rule as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660381783






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458044084



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnionSuite.scala
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types._
+
+class ResolveUnionSuite extends AnalysisTest {
+  test("Resolve Union") {
+    val table1 = LocalRelation(
+      AttributeReference("i", IntegerType)(),
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)())
+    val table2 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+    val table3 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+
+    val rules = Seq(ResolveUnion)
+    val analyzer = new RuleExecutor[LogicalPlan] {
+      override val batches = Seq(Batch("Resolution", Once, rules: _*))
+    }
+
+    // By name resolution
+    val union1 = Union(table1 :: table2 :: Nil, true, false)
+    val analyzed1 = analyzer.execute(union1)
+    val projected1 =
+      Project(Seq(table2.output(3), table2.output(0), table2.output(1), table2.output(2)), table2)
+    val expected1 = Union(table1 :: projected1 :: Nil)
+    comparePlans(analyzed1, expected1)
+
+    // Allow missing column
+    val union2 = Union(table1 :: table3 :: Nil, true, true)
+    val analyzed2 = analyzer.execute(union2)
+    val nullAttr = Alias(Literal(null, ByteType), "b")()
+    val projected2 =
+      Project(Seq(table2.output(3), table2.output(0), nullAttr, table2.output(2)), table3)
+    val expected2 = Union(table1 :: projected2 :: Nil)
+    comparePlans(analyzed2, expected2)
+
+    // By name + Allow missing column

Review comment:
       by position?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458962998



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1387,7 +1388,7 @@ class Analyzer(
         i.copy(right = dedupRight(left, right))
       case e @ Except(left, right, _) if !e.duplicateResolved =>
         e.copy(right = dedupRight(left, right))
-      case u @ Union(children) if !u.duplicateResolved =>
+      case u @ Union(children, _, _) if !u.duplicateResolved =>

Review comment:
       I think it's safer to add it, to explicitly define the rule order (by-name resolution should happen before this rule)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660536317






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662720080






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458522170



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
+import org.apache.spark.sql.catalyst.optimizer.CombineUnions
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.util.SchemaUtils
+
+/**
+ * Resolves different children of Union to a common set of columns.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +

Review comment:
       ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660423434


   **[Test build #126077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126077/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658481455






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458037250



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved

Review comment:
       It's fragile to rely on rule order. How about we skip the type coercion rule if the union is by name and the name match is not done yet?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662282448






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458631494



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1387,7 +1388,7 @@ class Analyzer(
         i.copy(right = dedupRight(left, right))
       case e @ Except(left, right, _) if !e.duplicateResolved =>
         e.copy(right = dedupRight(left, right))
-      case u @ Union(children) if !u.duplicateResolved =>
+      case u @ Union(children, _, _) if !u.duplicateResolved =>

Review comment:
       we should do this only when the by-name resolution is done?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662761103


   **[Test build #126355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126355/testReport)** for PR 29107 at commit [`2ab990e`](https://github.com/apache/spark/commit/2ab990e0bab93d8092a6b9090eab8c746462f940).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458359852



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnionSuite.scala
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types._
+
+class ResolveUnionSuite extends AnalysisTest {
+  test("Resolve Union") {
+    val table1 = LocalRelation(
+      AttributeReference("i", IntegerType)(),
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)())
+    val table2 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("b", ByteType)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+    val table3 = LocalRelation(
+      AttributeReference("u", DecimalType.SYSTEM_DEFAULT)(),
+      AttributeReference("d", DoubleType)(),
+      AttributeReference("i", IntegerType)())
+
+    val rules = Seq(ResolveUnion)
+    val analyzer = new RuleExecutor[LogicalPlan] {
+      override val batches = Seq(Batch("Resolution", Once, rules: _*))
+    }
+
+    // By name resolution
+    val union1 = Union(table1 :: table2 :: Nil, true, false)
+    val analyzed1 = analyzer.execute(union1)
+    val projected1 =
+      Project(Seq(table2.output(3), table2.output(0), table2.output(1), table2.output(2)), table2)
+    val expected1 = Union(table1 :: projected1 :: Nil)
+    comparePlans(analyzed1, expected1)
+
+    // Allow missing column
+    val union2 = Union(table1 :: table3 :: Nil, true, true)
+    val analyzed2 = analyzer.execute(union2)
+    val nullAttr = Alias(Literal(null, ByteType), "b")()
+    val projected2 =
+      Project(Seq(table2.output(3), table2.output(0), nullAttr, table2.output(2)), table3)
+    val expected2 = Union(table1 :: projected2 :: Nil)
+    comparePlans(analyzed2, expected2)
+
+    // By name + Allow missing column
+    val union3 = Union(union1 :: union2 :: Nil)
+    val analyzed3 = analyzer.execute(union3)
+    val expected3 = Union(expected1 :: expected2 :: Nil)

Review comment:
       Oh, let me change it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658398526


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125845/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662629194


   **[Test build #126355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126355/testReport)** for PR 29107 at commit [`2ab990e`](https://github.com/apache/spark/commit/2ab990e0bab93d8092a6b9090eab8c746462f940).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662192692






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660500087


   **[Test build #126112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126112/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660440513






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662831333






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458369336



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -220,8 +220,15 @@ object Union {
 
 /**
  * Logical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.
+ *
+ * @param byName          Whether resolves columns in the children by column names.
+ * @param allowMissingCol Allows missing columns in children query plans. If it is true,
+ *                        this function allows different set of column names between two Datasets.
  */
-case class Union(children: Seq[LogicalPlan]) extends LogicalPlan {
+case class Union(
+    children: Seq[LogicalPlan],
+    byName: Boolean = false,
+    allowMissingCol: Boolean = false) extends LogicalPlan {

Review comment:
       Added.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r455521016



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       If looks not proper after rethinking, we can also move to other rule or create another rule.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658398100


   **[Test build #125845 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125845/testReport)** for PR 29107 at commit [`93c5ea1`](https://github.com/apache/spark/commit/93c5ea1973c062f721693fc3cc8419a336796beb).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658406021


   Thank you, @viirya .
   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658354245


   **[Test build #125853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125853/testReport)** for PR 29107 at commit [`8e4867a`](https://github.com/apache/spark/commit/8e4867ac334c3b84154f75c364af44d203309cd4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662719424


   **[Test build #126348 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126348/testReport)** for PR 29107 at commit [`0381f5d`](https://github.com/apache/spark/commit/0381f5db317e848851e661d27d6f7a35eb7d637f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662192271


   **[Test build #126277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126277/testReport)** for PR 29107 at commit [`1839987`](https://github.com/apache/spark/commit/1839987d63fc94a71e064f82b57b400eaa7fb220).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458036464



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {

Review comment:
       Can we put it in a new file?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658292521


   **[Test build #125845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125845/testReport)** for PR 29107 at commit [`93c5ea1`](https://github.com/apache/spark/commit/93c5ea1973c062f721693fc3cc8419a336796beb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660440513


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r456714860



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       ok, moved.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458916883



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -428,7 +428,7 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     errMsg = intercept[AnalysisException] {
       df1.unionByName(df2)
     }.getMessage
-    assert(errMsg.contains("""Cannot resolve column name "b" among (a, c, d)"""))
+    assert(errMsg.contains("""Cannot resolve column name `b` among (a, c, d)"""))

Review comment:
       ok. let me revert to previous one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660427670


   **[Test build #126093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126093/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662851158


   **[Test build #126388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126388/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660499922


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663214759


   **[Test build #126417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126417/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458368303



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2085,37 +2084,9 @@ class Dataset[T] private[sql](
       "in the right attributes",

Review comment:
       ok. moved.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r459212340



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1387,10 +1388,11 @@ class Analyzer(
         i.copy(right = dedupRight(left, right))
       case e @ Except(left, right, _) if !e.duplicateResolved =>
         e.copy(right = dedupRight(left, right))
-      case u @ Union(children) if !u.duplicateResolved =>
+      // Only after we finish by-name resolution for Union
+      case u: logical.Union if !u.duplicateResolved && !u.byName =>

Review comment:
       nit: `logical.Union` -> `Union`.
   
   `!u.duplicateResolved && !u.byName` -> `!u.byName && !u.duplicateResolved`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r454848703



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -1099,6 +1101,64 @@ object TypeCoercion {
         DateSub(l, Literal(days))
     }
   }
+
+  /**
+   * Coerces different children of Union to a common set of columns. Note that this must be
+   * run before `WidenSetOperationTypes`, because `WidenSetOperationTypes` should be run on
+   * correctly resolved column by name.
+   */
+  object UnionCoercion extends TypeCoercionRule {
+    private def unionTwoSides(
+        left: LogicalPlan, right: LogicalPlan, allowMissingCol: Boolean): LogicalPlan = {
+      val resolver = SQLConf.get.resolver
+      val leftOutputAttrs = left.output
+      val rightOutputAttrs = right.output
+
+      // Builds a project list for `right` based on `left` output names
+      val rightProjectList = leftOutputAttrs.map { lattr =>
+        rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+          if (allowMissingCol) {
+            Alias(Literal(null, lattr.dataType), lattr.name)()
+          } else {
+            throw new AnalysisException(
+              s"""Cannot resolve column name "${lattr.name}" among """ +
+                s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+          }
+        }
+      }
+
+      // Delegates failure checks to `CheckAnalysis`
+      val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+      val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+      // Builds a project for `logicalPlan` based on `right` output names, if allowing
+      // missing columns.
+      val leftChild = if (allowMissingCol) {
+        val missingAttrs = notFoundAttrs.map { attr =>
+          Alias(Literal(null, attr.dataType), attr.name)()
+        }
+        if (missingAttrs.nonEmpty) {
+          Project(leftOutputAttrs ++ missingAttrs, left)
+        } else {
+          left
+        }
+      } else {
+        left
+      }
+      Union(leftChild, rightChild)
+    }
+
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+      case e if !e.childrenResolved => e
+
+      case Union(children, byName, allowMissingCol)
+          if byName =>
+        val union = children.reduceLeft { (left: LogicalPlan, right: LogicalPlan) =>
+          unionTwoSides(left, right, allowMissingCol)

Review comment:
       hm I see.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662840761


   **[Test build #126388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126388/testReport)** for PR 29107 at commit [`2a9e1e4`](https://github.com/apache/spark/commit/2a9e1e44e03ce0551d619aec52d60cd2757c422b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458638512



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -428,7 +428,7 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     errMsg = intercept[AnalysisException] {
       df1.unionByName(df2)
     }.getMessage
-    assert(errMsg.contains("""Cannot resolve column name "b" among (a, c, d)"""))
+    assert(errMsg.contains("""Cannot resolve column name `b` among (a, c, d)"""))

Review comment:
       Shall we keep the previous error message? In general we should only use backticks when necessary(has special chars)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458041832



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -556,15 +556,15 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 
     // Push down deterministic projection through UNION ALL
-    case p @ Project(projectList, Union(children)) =>
+    case p @ Project(projectList, Union(children, byName, allowMissingCol)) =>
       assert(children.nonEmpty)
       if (projectList.forall(_.deterministic)) {
         val newFirstChild = Project(projectList, children.head)
         val newOtherChildren = children.tail.map { child =>
           val rewrites = buildRewrites(children.head, child)
           Project(projectList.map(pushToRight(_, rewrites)), child)
         }
-        Union(newFirstChild +: newOtherChildren)
+        Union(newFirstChild +: newOtherChildren, byName, allowMissingCol)

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r456741460



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -337,6 +337,10 @@ trait CheckAnalysis extends PredicateHelper {
 
           case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)
 
+          case Union(_, byName, allowMissingCol) if byName || allowMissingCol =>
+            failAnalysis("Union should not be with true `byName` or " +
+              "`allowMissingCol` flags after analysis phase.")

Review comment:
       Just a question; users can see this error message? That's the case of an analyzer bug?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663049879






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658354949






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r456740147



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
+    case e if !e.childrenResolved => e
+
+    case Union(children, byName, allowMissingCol)
+      if byName =>

Review comment:
       nit: `    case Union(children, byName, allowMissingCol) if byName =>`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660381500


   **[Test build #126077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126077/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458041550



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
##########
@@ -344,7 +344,7 @@ object TypeCoercion {
       case s: Union if s.childrenResolved &&
           s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved =>
         val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children)
-        s.makeCopy(Array(newChildren))
+        Union(newChildren, s.byName, s.allowMissingCol)

Review comment:
       nit: `s.copy(children = newChildren)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458041744



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -497,8 +497,8 @@ object LimitPushDown extends Rule[LogicalPlan] {
     // Note: right now Union means UNION ALL, which does not de-duplicate rows, so it is safe to
     // pushdown Limit through it. Once we add UNION DISTINCT, however, we will not be able to
     // pushdown Limit.
-    case LocalLimit(exp, Union(children)) =>
-      LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _))))
+    case LocalLimit(exp, Union(children, byName, allowMissingCol)) =>
+      LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _)), byName, allowMissingCol))

Review comment:
       nit: use `copy` would be better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660427858






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458044495



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -220,8 +220,15 @@ object Union {
 
 /**
  * Logical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.
+ *
+ * @param byName          Whether resolves columns in the children by column names.
+ * @param allowMissingCol Allows missing columns in children query plans. If it is true,
+ *                        this function allows different set of column names between two Datasets.
  */
-case class Union(children: Seq[LogicalPlan]) extends LogicalPlan {
+case class Union(
+    children: Seq[LogicalPlan],
+    byName: Boolean = false,
+    allowMissingCol: Boolean = false) extends LogicalPlan {

Review comment:
       we should add an assert: `allowMissingCol` can only be true of `byName` is true.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458634082



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
##########
@@ -50,7 +50,7 @@ object PropagateEmptyRelation extends Rule[LogicalPlan] with PredicateHelper wit
   override def conf: SQLConf = SQLConf.get
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case p @ Union(children) if children.exists(isEmptyLocalRelation) =>
+    case p @ Union(children, _, _) if children.exists(isEmptyLocalRelation) =>

Review comment:
       shall we skip if by-name resolution is not finished? Otherwise the final output schema can be indeterministic, if some children are empty relation and get removed before participating the by-name resolution.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662629745






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458040501



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3676,3 +3678,63 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Resolves different children of Union to a common set of columns. Note that this must be
+ * run before `TypeCoercion`, because `TypeCoercion` should be run on correctly resolved
+ * column by name.
+ */
+object ResolveUnion extends Rule[LogicalPlan] {
+  private def unionTwoSides(
+      left: LogicalPlan,
+      right: LogicalPlan,
+      allowMissingCol: Boolean): LogicalPlan = {
+    val resolver = SQLConf.get.resolver
+    val leftOutputAttrs = left.output
+    val rightOutputAttrs = right.output
+
+    // Builds a project list for `right` based on `left` output names
+    val rightProjectList = leftOutputAttrs.map { lattr =>
+      rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
+        if (allowMissingCol) {
+          Alias(Literal(null, lattr.dataType), lattr.name)()
+        } else {
+          throw new AnalysisException(
+            s"""Cannot resolve column name "${lattr.name}" among """ +
+              s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        }
+      }
+    }
+
+    // Delegates failure checks to `CheckAnalysis`
+    val notFoundAttrs = rightOutputAttrs.diff(rightProjectList)
+    val rightChild = Project(rightProjectList ++ notFoundAttrs, right)
+
+    // Builds a project for `logicalPlan` based on `right` output names, if allowing
+    // missing columns.
+    val leftChild = if (allowMissingCol) {
+      val missingAttrs = notFoundAttrs.map { attr =>
+        Alias(Literal(null, attr.dataType), attr.name)()
+      }
+      if (missingAttrs.nonEmpty) {
+        Project(leftOutputAttrs ++ missingAttrs, left)
+      } else {
+        left
+      }
+    } else {
+      left
+    }
+    Union(leftChild, rightChild)

Review comment:
       nvm, it was the behavior before and seems hard to get rid of it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662286329


   **[Test build #126311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126311/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660500244






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660536155


   **[Test build #126112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126112/testReport)** for PR 29107 at commit [`c23898e`](https://github.com/apache/spark/commit/c23898ef1b120ea9e5d1659cdb76502214dce97b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-660426922


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-658398511






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662720080






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662282155


   **[Test build #126294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126294/testReport)** for PR 29107 at commit [`eca8fc6`](https://github.com/apache/spark/commit/eca8fc612581cabee9aa5c376b538d51a7482931).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458041744



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -497,8 +497,8 @@ object LimitPushDown extends Rule[LogicalPlan] {
     // Note: right now Union means UNION ALL, which does not de-duplicate rows, so it is safe to
     // pushdown Limit through it. Once we add UNION DISTINCT, however, we will not be able to
     // pushdown Limit.
-    case LocalLimit(exp, Union(children)) =>
-      LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _))))
+    case LocalLimit(exp, Union(children, byName, allowMissingCol)) =>
+      LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _)), byName, allowMissingCol))

Review comment:
       nit: use `copy` would be better, as it preserves the tree node tag




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-662557943






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29107:
URL: https://github.com/apache/spark/pull/29107#issuecomment-663342244


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #29107: [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #29107:
URL: https://github.com/apache/spark/pull/29107#discussion_r458915813



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
##########
@@ -50,7 +50,7 @@ object PropagateEmptyRelation extends Rule[LogicalPlan] with PredicateHelper wit
   override def conf: SQLConf = SQLConf.get
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case p @ Union(children) if children.exists(isEmptyLocalRelation) =>
+    case p @ Union(children, _, _) if children.exists(isEmptyLocalRelation) =>

Review comment:
       by-name resolution is finished after analysis. As PropagateEmptyRelation is run in optimization, I think by-name resolution should be done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org