You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/04 21:39:39 UTC

[GitHub] [spark] viirya opened a new pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

viirya opened a new pull request #28996:
URL: https://github.com/apache/spark/pull/28996


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch proposes to make `unionByName` optionally fill missing columns with nulls, if corresponding config is enabled.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Currently, `unionByName` throws exception if detecting different column names between two Datasets. It is strict requirement and sometimes users require more flexible usage that two Datasets with different subset of columns can be union by name resolution.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   Yes. If `spark.sql.allowMissingColumnsInUnionByName` is enabled, `Dataset.unionByName` allows different set of column names between two Datasets. Missing columns at each side, will be filled with null values.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Unit test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453054212



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,25 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
+   * union (that does deduplication of elements), use this function followed by a [[distinct]].

Review comment:
       I read `To do a SQL-style set union`, it sounds like if you add `distinct`, you will get a SQL-style union. But it behaves different to SQL union at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656512088






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654645739






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654953315






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656757538


   I'll add Python and R in a follow-up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453460284



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       I think the major problem here is we put the by-name logic in the API method, not in the `Analyzer`. Shall we add a boolean parameter to `Union`, and move the by-name logic to the type coercion rules?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654367846






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653931886


   **[Test build #124924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124924/testReport)** for PR 28996 at commit [`6afb8e8`](https://github.com/apache/spark/commit/6afb8e8ba07b73b0df6930075418e8f8299a198b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657102089


   **[Test build #125688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125688/testReport)** for PR 28996 at commit [`8734983`](https://github.com/apache/spark/commit/873498394eaf80b2b302b6fc7aeec410e7113415).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654262786


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r452061943



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       We can add an overload method instead of using default parameter value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654059408


   **[Test build #125038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125038/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657101250






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654382695


   **[Test build #125111 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125111/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453215994



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,22 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with

Review comment:
       Good advice.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,22 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values.
+   *

Review comment:
       okay.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656469724


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656469728


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125526/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bart-samwel commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
bart-samwel commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r450166216



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       I think there's value in forbidding this behavior by default, because that might hide logical errors (unioning two totally different schemas). But this isn't a matter of old/new, and it shouldn't be decided by global configs -- users would need to set the old or the new behavior *on a case by case basis*. Why not add a boolean parameter to `unionByName` to allow/disallow, defaulting to false (disallow)? Or maybe even add this as a separate function `unionByNameAllowMissing`? (Boolean arguments are not terribly informative to readers, so I'd actually lean towards having a separate name for this.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654314564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654059845






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-655183548






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656509472


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654952244


   **[Test build #125232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125232/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654007344


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654007344






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654507324


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654314022


   **[Test build #125098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125098/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] marmbrus commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
marmbrus commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453046549



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,25 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
+   * union (that does deduplication of elements), use this function followed by a [[distinct]].

Review comment:
       Wait really? When did we change the semantics? What was confusing about that documentation? (it was added because users were confused by the behavior...)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453210465



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,22 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values.
+   *

Review comment:
       Could you add an illustrate example like 2016 ~ 2029, @viirya ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654314564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656417137


   **[Test build #125526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125526/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654644097


   **[Test build #125141 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125141/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656417137


   **[Test build #125526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125526/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654759370


   **[Test build #125208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125208/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654767299


   **[Test build #125208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125208/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657102089






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #28996:
URL: https://github.com/apache/spark/pull/28996


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657143519


   **[Test build #125688 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125688/testReport)** for PR 28996 at commit [`8734983`](https://github.com/apache/spark/commit/873498394eaf80b2b302b6fc7aeec410e7113415).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654053594


   My bad and my last comment is ambiguous; I would mean that, how about adding some comments for this new behaviour in the API doc so that users can notice it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483418373



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values. The missing columns at left Dataset will be added at the end in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col0|col1|col2|col3|
+   *   // +----+----+----+----+
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +----+----+----+----+
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col1|col0|col3|col2|
+   *   // +----+----+----+----+
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +----+----+----+----+
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {

Review comment:
       Do we have a JIRA to add the corresponding API for Python? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657143289


   Merged to master for Apache Spark 3.1.0. Thank you, @viirya and all. 
   At the last commit, all UTs passed already.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656520798






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654056095


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125022/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654010095






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656509082


   **[Test build #125551 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125551/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654510421


   **[Test build #125141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125141/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654056087






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654367565


   **[Test build #125098 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125098/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453459502



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,25 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
+   * union (that does deduplication of elements), use this function followed by a [[distinct]].

Review comment:
       Seems like we mistakenly copied the doc from `union` to `unionByName`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654054390


   **[Test build #125022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125022/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656753866


   **[Test build #125625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125625/testReport)** for PR 28996 at commit [`e2311fa`](https://github.com/apache/spark/commit/e2311fafc171ad47aaee8bcd98e1cd0e6c745016).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654759370


   **[Test build #125208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125208/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656417357






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r454566423



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       Thanks, @cloud-fan .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656509482


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125551/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654004956


   > > To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].
   
   Read with the previous sentence, I think the doc means that this API doesn't deduplicate elements. The doc explains that this API resolves columns by name, not by position like `union`.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656520805


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125563/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656511780


   **[Test build #125563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449869961



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))
+      checkAnswer(unionDf,
+        Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+          Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+          Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+      )
+
+      df1 = Seq((1, 2)).toDF("a", "c")
+      df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+      checkAnswer(df1.unionByName(df2),

Review comment:
       Is this operation asymmetric? `df1.unionByName(df2)` accepted and `df2.unionByName(df1)` not accepted?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654383278






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656472034


   **[Test build #125551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125551/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654007350


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124924/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656520798


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654059845






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654059408


   **[Test build #125038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125038/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-655182001


   **[Test build #125232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125232/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656417357






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r450162816



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       It seems not a breaking change if this case fails before and now we allow it by filling missing columns with nulls. Do we really need a config? cc @gatorsmile @bart-samwel




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656525585






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656472252






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654010095






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656511780


   **[Test build #125563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656943786






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653920644


   `unionByName` is not SQL-style union, as the API doc said.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r452068789



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       +1 for overloading.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656469724






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r451985674



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       But a default value parameter seems bad for Java caller?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654381621


   retest this please...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654510513






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453744272



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       @cloud-fan . `unionByName` has been here since Apache Spark 2.3.0. 
   It would be great if we can proceed that refactoring suggestion as a separate JIRA.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654760111






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654508634






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657143774






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453744272



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       @cloud-fan . `unionByName` (and `by-name logic`) has been here since Apache Spark 2.3.0. 
   It would be great if we can proceed that refactoring suggestion as a separate JIRA.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657143955


   **[Test build #125687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125687/testReport)** for PR 28996 at commit [`f0bf462`](https://github.com/apache/spark/commit/f0bf462556af20d416b85a2882a89ffef873ad80).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656520178


   **[Test build #125563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449933207



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))
+      checkAnswer(unionDf,
+        Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+          Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+          Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+      )
+
+      df1 = Seq((1, 2)).toDF("a", "c")
+      df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+      checkAnswer(df1.unionByName(df2),

Review comment:
       Could you add tests for both cases?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-655183548






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656525135


   **[Test build #125572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125572/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656472252






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656523020


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656745260


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654262794


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125038/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657144208






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449933371



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))
+      checkAnswer(unionDf,
+        Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+          Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+          Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+      )
+
+      df1 = Seq((1, 2)).toDF("a", "c")
+      df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+      checkAnswer(df1.unionByName(df2),

Review comment:
       sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r450312370



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       Adding a boolean parameter with default value sounds better than new method `unionByNameAllowMissing`, because I guess we are more conservative on adding new method into Dataset. But I have concern about calling with default parameter from Java.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654952244


   **[Test build #125232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125232/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449933260



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))

Review comment:
       This shows the behavior of enabling this config. Yeah, it looks like to merge the schema of different datasets.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453744272



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       @cloud-fan . `unionByName` (and `by-name logic`) has been here since Apache Spark 2.3.0. 
   Shall we proceed that refactoring suggestion as a separate JIRA?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654767354


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125208/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453434564



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       Does it work with nested columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654645739


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653953269


   Could you update the API doc, too? If the option enabled, the following statement doesn't hold?
   
   >    To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654009977


   **[Test build #125022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125022/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449933102



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))

Review comment:
       We need to union the case with no common column? The behaivour looks like `mergeSchema` in parquet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656512088






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654510513






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shaneknapp commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
shaneknapp commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654508986


   test this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453447856



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       No, currently it doesn't.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483418810



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values. The missing columns at left Dataset will be added at the end in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col0|col1|col2|col3|
+   *   // +----+----+----+----+
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +----+----+----+----+
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col1|col0|col3|col2|
+   *   // +----+----+----+----+
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +----+----+----+----+
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {

Review comment:
       This is a good beginner task for new contributors.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656472034


   **[Test build #125551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125551/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453460284



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       I think the major problem here is we put the by-name logic in the API method, not in the `Analyzer`. Shall we add 2 boolean parameters(byName and allowMissingCol) to `Union`, and move the by-name logic to the type coercion rules?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653815728


   @marmbrus @HyukjinKwon @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653932114






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654767345






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453744066



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       Ok. I will do it in another PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453215942



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    var df1 = Seq(1, 2, 3).toDF("a")
+    var df2 = Seq(3, 1, 2).toDF("b")
+    val df3 = Seq(2, 3, 1).toDF("c")
+    val unionDf = df1.unionByName(df2.unionByName(df3, true), true)
+    checkAnswer(unionDf,
+      Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+        Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+        Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+    )
+
+    df1 = Seq((1, 2)).toDF("a", "c")
+    df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+    checkAnswer(df1.unionByName(df2, true),
+      Row(1, 2, null) :: Row(3, 5, 4) :: Nil)
+    checkAnswer(df2.unionByName(df1, true),
+      Row(3, 4, 5) :: Row(1, null, 2) :: Nil)

Review comment:
       sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453000366



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,25 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
+   * union (that does deduplication of elements), use this function followed by a [[distinct]].

Review comment:
       Actually in original `unionByName`, its doc has this section too:
   
   > This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
   > union (that does deduplication of elements), use this function followed by a [[distinct]].
   
   Re-read this doc, even with original `unionByName` behavior, it is a bit confusing to me. Do you think we should remove "To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]]."?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654507324






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654949758


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483419615



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values. The missing columns at left Dataset will be added at the end in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col0|col1|col2|col3|
+   *   // +----+----+----+----+
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +----+----+----+----+
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col1|col0|col3|col2|
+   *   // +----+----+----+----+
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +----+----+----+----+
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {

Review comment:
       I should create a followup PR for Python and R. But it is okay for a beginner task too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656753660






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656509472






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656753866


   **[Test build #125625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125625/testReport)** for PR 28996 at commit [`e2311fa`](https://github.com/apache/spark/commit/e2311fafc171ad47aaee8bcd98e1cd0e6c745016).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656943786






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654314022


   **[Test build #125098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125098/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r450312370



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       Adding a boolean parameter sounds better than new method `unionByNameAllowMissing`, because I guess we are more conservative on adding new method into Dataset. But I have concern about calling with default parameter from Java.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656525585






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654262786






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r451359616



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       Seems like `Dataset` already has many APIs taking a boolean parameter. I'm OK with adding a `allowMissingColumns` parameter to `union`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656469396


   **[Test build #125526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125526/testReport)** for PR 28996 at commit [`df4e8dc`](https://github.com/apache/spark/commit/df4e8dc6a4bed3959b4317e3ff39da9f8aef5548).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653881086


   Is this a syntax-sugar of `df1.unionByName(df2.withColumn("c", lit(null)))`? The fused operation does not look like a SQL union, so how about adding a new API if this is useful for users?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654510421


   **[Test build #125141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125141/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654004956


   > > To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].
   
   Read with the previous sentence, I think the doc means that this API doesn't deduplicate elements. The doc explains that this API resolves columns by name, not by position like `union`. This config doesn't change the behavior.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r450312370



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2656,6 +2656,14 @@ object SQLConf {
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME =
+    buildConf("spark.sql.allowMissingColumnsInUnionByName")
+    .doc("If this config is enabled, `Dataset.unionByName` allows different set of column names " +
+      "between two Datasets. Missing columns at each side, will be filled with null values.")

Review comment:
       Adding a boolean parameter sounds better than new method `unionByNameAllowMissing`. But I have concern about calling with default parameter from Java.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654311560


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r449904117



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    withSQLConf(SQLConf.ALLOW_MISSING_COLUMNS_IN_UNION_BY_NAME.key -> "true") {
+      var df1 = Seq(1, 2, 3).toDF("a")
+      var df2 = Seq(3, 1, 2).toDF("b")
+      val df3 = Seq(2, 3, 1).toDF("c")
+      val unionDf = df1.unionByName(df2.unionByName(df3))
+      checkAnswer(unionDf,
+        Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+          Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+          Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+      )
+
+      df1 = Seq((1, 2)).toDF("a", "c")
+      df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+      checkAnswer(df1.unionByName(df2),

Review comment:
       `df2.unionByName(df1)` is also okay.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657102194






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657102194






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656471541


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656745273






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654259977


   **[Test build #125038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125038/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654508634






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654367846


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657143774






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483511215



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values. The missing columns at left Dataset will be added at the end in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col0|col1|col2|col3|
+   *   // +----+----+----+----+
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +----+----+----+----+
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +----+----+----+----+
+   *   // |col1|col0|col3|col2|
+   *   // +----+----+----+----+
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +----+----+----+----+
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {

Review comment:
       I filed at SPARK-32798 and SPARK-32799




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654645750


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125141/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654056460


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654760111






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654367852


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125098/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654756926


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656745260






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453211013



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2030,7 +2030,22 @@ class Dataset[T] private[sql](
    * @group typedrel
    * @since 2.3.0
    */
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with

Review comment:
       It's worth to document a little more about the order sensitive. Previously, it was simple because it follows the schema of original set(=left). With new options, the number of missing columns which will be added at the end are determined by `other` (=right).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657101250






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-657101103


   **[Test build #125687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125687/testReport)** for PR 28996 at commit [`f0bf462`](https://github.com/apache/spark/commit/f0bf462556af20d416b85a2882a89ffef873ad80).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654508185


   retest this please...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654767345


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656942423


   **[Test build #125625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125625/testReport)** for PR 28996 at commit [`e2311fa`](https://github.com/apache/spark/commit/e2311fafc171ad47aaee8bcd98e1cd0e6c745016).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453211198



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -506,4 +506,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     check(lit(2).cast("int"), $"c" === 2, Seq(Row(1, 1, 2, 0), Row(1, 1, 2, 1)))
     check(lit(2).cast("int"), $"c" =!= 2, Seq())
   }
+
+  test("SPARK-29358: Make unionByName optionally fill missing columns with nulls") {
+    var df1 = Seq(1, 2, 3).toDF("a")
+    var df2 = Seq(3, 1, 2).toDF("b")
+    val df3 = Seq(2, 3, 1).toDF("c")
+    val unionDf = df1.unionByName(df2.unionByName(df3, true), true)
+    checkAnswer(unionDf,
+      Row(1, null, null) :: Row(2, null, null) :: Row(3, null, null) :: // df1
+        Row(null, 3, null) :: Row(null, 1, null) :: Row(null, 2, null) :: // df2
+        Row(null, null, 2) :: Row(null, null, 3) :: Row(null, null, 1) :: Nil // df3
+    )
+
+    df1 = Seq((1, 2)).toDF("a", "c")
+    df2 = Seq((3, 4, 5)).toDF("a", "b", "c")
+    checkAnswer(df1.unionByName(df2, true),
+      Row(1, 2, null) :: Row(3, 5, 4) :: Nil)
+    checkAnswer(df2.unionByName(df1, true),
+      Row(3, 4, 5) :: Row(1, null, 2) :: Nil)

Review comment:
       @viirya . Can we have both case-sensitive and case-insensitive test coverage?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654507330


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125111/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654383278






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654009977


   **[Test build #125022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125022/testReport)** for PR 28996 at commit [`5e4f670`](https://github.com/apache/spark/commit/5e4f67002955fa0536498df6657b5db5b17b0d56).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656511207


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654953315






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-656532102


   shall we add the same API to PySpark and R?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r453777434



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2048,19 +2088,34 @@ class Dataset[T] private[sql](
     // Builds a project list for `other` based on `logicalPlan` output names
     val rightProjectList = leftOutputAttrs.map { lattr =>
       rightOutputAttrs.find { rattr => resolver(lattr.name, rattr.name) }.getOrElse {
-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {

Review comment:
       Yea it's better to have a new JIRA.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-653932114






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28996:
URL: https://github.com/apache/spark/pull/28996#issuecomment-654056087


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org