You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/16 19:29:39 UTC

[GitHub] [spark] viirya opened a new pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

viirya opened a new pull request #34025:
URL: https://github.com/apache/spark/pull/34025


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch proposes to fix incorrect schema of `union`.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition.
   
   So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   Yes, fixing a bug of incorrect schema.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Added unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921320041


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710680268



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          mergeNullability(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          mergeNullability(leftKeyType, rightKeyType),
+          mergeNullability(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability, " +
+          "two structs must have same number of fields.")

Review comment:
       is this error user-facing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710529665



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {

Review comment:
       Let me add test case for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710528051



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {

Review comment:
       From the JIRA description, this is not with incorrect `nullability` but incorrect merged schema for union.
   
   For example, the added test case should have only one column in the nested struct after union, but currently we got 2 columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921664931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47911/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921234775


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921290240


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143363/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921289788


   **[Test build #143363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143363/testReport)** for PR 34025 at commit [`924b413`](https://github.com/apache/spark/commit/924b4130244ec73df200dc6f3b9509dd8cafba71).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921445382


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47893/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921481445


   **[Test build #143390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143390/testReport)** for PR 34025 at commit [`051a152`](https://github.com/apache/spark/commit/051a152c3ad9a5e7d30c0c1f98b69d4448a0aa66).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921679712


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143390/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921366528


   **[Test build #143375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143375/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710728629



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,52 +559,82 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This leverages `merge` to merge data types for UNION operator by specializing
+   * the handling of struct types to follow UNION semantics.
+   */
+  private[sql] def unionLikeMerge(left: DataType, right: DataType): DataType =
+    mergeInternal(left, right, (s1: StructType, s2: StructType) => {
+      val leftFields = s1.fields
+      val rightFields = s2.fields
+      require(leftFields.size == rightFields.size, "To merge nullability, " +
+        "two structs must have same number of fields.")
+
+      val newFields = leftFields.zip(rightFields).map {
+        case (leftField@StructField(_, leftType, leftNullable, _),
+        _@StructField(_, rightType, rightNullable, _)) =>
+          leftField.copy(
+            dataType = unionLikeMerge(leftType, rightType),
+            nullable = leftNullable || rightNullable)
+      }.toSeq
+      StructType(newFields)
+    })
+
   private[sql] def merge(left: DataType, right: DataType): DataType =
+    mergeInternal(left, right, (s1: StructType, s2: StructType) => {
+      val leftFields = s1.fields
+      val rightFields = s2.fields
+      val newFields = mutable.ArrayBuffer.empty[StructField]
+
+      val rightMapped = fieldsMap(rightFields)
+      leftFields.foreach {
+        case leftField @ StructField(leftName, leftType, leftNullable, _) =>
+          rightMapped.get(leftName)
+            .map { case rightField @ StructField(rightName, rightType, rightNullable, _) =>
+              try {
+                leftField.copy(
+                  dataType = merge(leftType, rightType),
+                  nullable = leftNullable || rightNullable)
+              } catch {
+                case NonFatal(e) =>
+                  throw QueryExecutionErrors.failedMergingFieldsError(leftName, rightName, e)
+              }
+            }
+            .orElse {
+              Some(leftField)
+            }
+            .foreach(newFields += _)
+      }
+
+      val leftMapped = fieldsMap(leftFields)
+      rightFields
+        .filterNot(f => leftMapped.get(f.name).nonEmpty)
+        .foreach { f =>
+          newFields += f
+        }
+
+      StructType(newFields.toSeq)
+    })
+
+  private[sql] def mergeInternal(

Review comment:
       nit: 
   ```suggestion
     private def mergeInternal(
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710443830



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField@StructField(_, leftType, leftNullable, _),

Review comment:
       Shall we add some space around `@`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710447876



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField@StructField(_, leftType, leftNullable, _),
+              _@StructField(_, rightType, rightNullable, _)) =>
+            leftField.copy(
+              dataType = mergeNullability(leftType, rightType),
+              nullable = leftNullable || rightNullable)
+        }.toSeq
+        StructType(newFields)
+
+      case (leftType, _) =>
+        leftType

Review comment:
       `union` needs the schema of its children to be compatible. `CheckAnalysis` does the final check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921290240


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143363/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921418922


   **[Test build #143375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143375/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710441656



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))

Review comment:
       Nice. So, this is also testing case sensitivity?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921201477


   **[Test build #143363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143363/testReport)** for PR 34025 at commit [`924b413`](https://github.com/apache/spark/commit/924b4130244ec73df200dc6f3b9509dd8cafba71).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921679712


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143390/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921550558


   **[Test build #143382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143382/testReport)** for PR 34025 at commit [`bfc304a`](https://github.com/apache/spark/commit/bfc304a858a55edafa65a2f747f830ae1547e0bd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921320074


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710697692



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          mergeNullability(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          mergeNullability(leftKeyType, rightKeyType),
+          mergeNullability(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability, " +
+          "two structs must have same number of fields.")

Review comment:
       We should not hit this normally, I think. `CheckAnalysis` should already prevent incompatible data types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921385346


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47882/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921522894


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47898/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710532702



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.union(df2)

Review comment:
       I added a test case. It works well too. This only changes the output schema of union, it doesn't impact how we resolve names for `unionByName`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921365001


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143367/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921677871


   **[Test build #143390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143390/testReport)** for PR 34025 at commit [`051a152`](https://github.com/apache/spark/commit/051a152c3ad9a5e7d30c0c1f98b69d4448a0aa66).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710846709



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    assert(df.schema == schema)
+    assert(df.queryExecution.optimizedPlan.schema == schema)
+    assert(df.queryExecution.executedPlan.schema == schema)
+
+    checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+    checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+  }
+
+  test("SPARK-36673: Union of structs with different orders") {
+    val df1 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner1"), struct(expr("id * 10 AS inner2"))))
+    val df2 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner2"), struct(expr("id * 10 AS inner1"))))

Review comment:
       Yea, sounds like we can just update the check in `CheckAnalysis` to consider the by-ordinal semantic. We can fix it separately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710838554



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,52 +559,82 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This leverages `merge` to merge data types for UNION operator by specializing
+   * the handling of struct types to follow UNION semantics.
+   */
+  private[sql] def unionLikeMerge(left: DataType, right: DataType): DataType =
+    mergeInternal(left, right, (s1: StructType, s2: StructType) => {
+      val leftFields = s1.fields
+      val rightFields = s2.fields
+      require(leftFields.size == rightFields.size, "To merge nullability, " +
+        "two structs must have same number of fields.")
+
+      val newFields = leftFields.zip(rightFields).map {
+        case (leftField@StructField(_, leftType, leftNullable, _),
+        _@StructField(_, rightType, rightNullable, _)) =>

Review comment:
       nit:
   ```
   case (leftField, rightField) =>
     leftField.copy(
       dataType = unionLikeMerge(leftField.dataType, rightField.dataType),
       nullable = ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710529346



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField@StructField(_, leftType, leftNullable, _),

Review comment:
       Was copied. Fixed now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710528528



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {

Review comment:
       And yea, current test title is too vague. Let me be specific here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921664931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47911/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921225791


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
xkrogen commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710450359



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField@StructField(_, leftType, leftNullable, _),
+              _@StructField(_, rightType, rightNullable, _)) =>
+            leftField.copy(
+              dataType = mergeNullability(leftType, rightType),
+              nullable = leftNullable || rightNullable)
+        }.toSeq
+        StructType(newFields)
+
+      case (leftType, _) =>
+        leftType

Review comment:
       ack, makes sense. Maybe we should mention in the Scaladoc that compatibility verification is _not_ done, since this is different from how `merge` works?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921433297


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47889/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921419472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921426361


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47889/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921551880


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143382/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921364363


   **[Test build #143367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143367/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921234775


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921522916


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47898/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710696143



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    assert(df.schema == schema)
+    assert(df.queryExecution.optimizedPlan.schema == schema)
+    assert(df.queryExecution.executedPlan.schema == schema)
+
+    checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+    checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+  }
+
+  test("SPARK-36673: Union of structs with different orders") {
+    val df1 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner1"), struct(expr("id * 10 AS inner2"))))
+    val df2 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner2"), struct(expr("id * 10 AS inner1"))))

Review comment:
       This test is more to reflect/verify what the current behavior is (i.e. the added test works as it is before this change). That is good point about how we should treat union of nested columns. As mentioned in previous comment, we have `CheckAnalysis` as the guard to prevent invalid UNION by checking the compatibility of datatypes. And for now we compare the fields of structs one by one (by position). If the any fields at same position cannot be resolved, `CheckAnalysis` thinks it as mismatched data types and stops the query.
   
   This PR is mainly for fixing the merging behavior for UNION which is not correct.
   
   We can work on the mentioned issue separately.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710447019



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),

Review comment:
       Oops, missed it. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921365001


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143367/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921433297


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47889/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921551880


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143382/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710696838



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          mergeNullability(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          mergeNullability(leftKeyType, rightKeyType),
+          mergeNullability(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability, " +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField @ StructField(_, leftType, leftNullable, _),
+              _ @ StructField(_, rightType, rightNullable, _)) =>
+            leftField.copy(
+              dataType = mergeNullability(leftType, rightType),
+              nullable = leftNullable || rightNullable)
+        }.toSeq
+        StructType(newFields)
+
+      case (leftType, _) =>
+        leftType
+    }
+
   private[sql] def merge(left: DataType, right: DataType): DataType =

Review comment:
       Nice!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921599635


   **[Test build #143403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143403/testReport)** for PR 34025 at commit [`9bed2e6`](https://github.com/apache/spark/commit/9bed2e665a9f7e72c7abfb7277bbd631397dce39).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921445343


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47893/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921385346


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47882/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921316802


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921643044


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47911/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
xkrogen commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710439691



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),

Review comment:
       doesn't this need to be `mergeNullability` ?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +

Review comment:
       minor typo nit: missing a space after the comma in the message

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField@StructField(_, leftType, leftNullable, _),
+              _@StructField(_, rightType, rightNullable, _)) =>
+            leftField.copy(
+              dataType = mergeNullability(leftType, rightType),
+              nullable = leftNullable || rightNullable)
+        }.toSeq
+        StructType(newFields)
+
+      case (leftType, _) =>
+        leftType

Review comment:
       Previously `merge` would do type compatibility checking:
   https://github.com/apache/spark/blob/924b4130244ec73df200dc6f3b9509dd8cafba71/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L663-L667
   Do we need to restore this logic, or does `union` handle doing type compatibility checking elsewhere?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),

Review comment:
       doesn't this need to be `mergeNullability` ?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.union(df2)

Review comment:
       does `unionByName` work properly in this situation with your new changes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921401718


   **[Test build #143382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143382/testReport)** for PR 34025 at commit [`bfc304a`](https://github.com/apache/spark/commit/bfc304a858a55edafa65a2f747f830ae1547e0bd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710528528



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {

Review comment:
       And current test title is too vague. Let me be specific here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921294563


   **[Test build #143367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143367/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921294563


   **[Test build #143367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143367/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710536085



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))

Review comment:
       Currently it is always case sensitive for the nested fields in structs during merging (ref. `StructType.merge`). But for `union`, we don't actually want to do a struct merging. For `union`, we don't merge the schema of its children for all cases but resolve columns between them (by position or by name). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921401718


   **[Test build #143382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143382/testReport)** for PR 34025 at commit [`bfc304a`](https://github.com/apache/spark/commit/bfc304a858a55edafa65a2f747f830ae1547e0bd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r711668057



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    assert(df.schema == schema)
+    assert(df.queryExecution.optimizedPlan.schema == schema)
+    assert(df.queryExecution.executedPlan.schema == schema)
+
+    checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+    checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+  }
+
+  test("SPARK-36673: Union of structs with different orders") {
+    val df1 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner1"), struct(expr("id * 10 AS inner2"))))
+    val df2 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner2"), struct(expr("id * 10 AS inner1"))))

Review comment:
       Opened #34038 for that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921383846


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47882/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #34025:
URL: https://github.com/apache/spark/pull/34025


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921519864


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47898/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921423017


   **[Test build #143386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143386/testReport)** for PR 34025 at commit [`ad4ec57`](https://github.com/apache/spark/commit/ad4ec57867beddc3e45bb739e26c4f87f2385ee4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921778994


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143403/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921381412


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47882/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921327091


   +CC @shardulm94 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921778994


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143403/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710522747



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {

Review comment:
       For two structs with different order (same or not same types), `CheckAnalysis` will fail such union query. E.g.,
   
   ```
   [info]   org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<inner2:bigint,col2:struct<inner1:bigint>> <> struct<inner1:bigi
   nt,col2:struct<inner2:bigint>> at the second column of the second table;              
   ```
   or
   
   ```
   [info]   org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<col1:string,col2:struct<inner1:bigint>> <> struct<inner1:bigint
   ,col2:struct<col1:string>> at the second column of the second table;                                
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710445138



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          merge(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          merge(leftKeyType, rightKeyType),
+          merge(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability," +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {

Review comment:
       This looks like a regression because this code assume that two structs have identical orders. The previous code doesn't. Could you compare with `merge` and add a test case for the different ordering?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921366528


   **[Test build #143375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143375/testReport)** for PR 34025 at commit [`f443f3c`](https://github.com/apache/spark/commit/f443f3c02b0cd3549aa4bca2d0fa2b8d8c5560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921419472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921440847


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47893/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921365279


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921445382


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47893/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710665992



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
##########
@@ -559,6 +559,42 @@ object StructType extends AbstractDataType {
       case _ => dt
     }
 
+  /**
+   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
+   * This method just merges nullability.
+   */
+  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
+    (left, right) match {
+      case (ArrayType(leftElementType, leftContainsNull),
+          ArrayType(rightElementType, rightContainsNull)) =>
+        ArrayType(
+          mergeNullability(leftElementType, rightElementType),
+          leftContainsNull || rightContainsNull)
+
+      case (MapType(leftKeyType, leftValueType, leftContainsNull),
+          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
+        MapType(
+          mergeNullability(leftKeyType, rightKeyType),
+          mergeNullability(leftValueType, rightValueType),
+          leftContainsNull || rightContainsNull)
+
+      case (StructType(leftFields), StructType(rightFields)) =>
+        require(leftFields.size == rightFields.size, "To merge nullability, " +
+          "two structs must have same number of fields.")
+
+        val newFields = leftFields.zip(rightFields).map {
+          case (leftField @ StructField(_, leftType, leftNullable, _),
+              _ @ StructField(_, rightType, rightNullable, _)) =>
+            leftField.copy(
+              dataType = mergeNullability(leftType, rightType),
+              nullable = leftNullable || rightNullable)
+        }.toSeq
+        StructType(newFields)
+
+      case (leftType, _) =>
+        leftType
+    }
+
   private[sql] def merge(left: DataType, right: DataType): DataType =

Review comment:
       What about something like this?
   
   ```diff
   diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
   index 5be328e0486..50e8d64feba 100644
   --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
   +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
   @@ -307,7 +307,7 @@ case class Union(
        children.map(_.output).transpose.map { attrs =>
          val firstAttr = attrs.head
          val nullable = attrs.exists(_.nullable)
   -      val newDt = attrs.map(_.dataType).reduce(StructType.mergeNullability)
   +      val newDt = attrs.map(_.dataType).reduce(StructType.unionLikeMerge)
          if (firstAttr.dataType == newDt) {
            firstAttr.withNullability(nullable)
          } else {
   diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
   index 7c0f9cc51e9..7d571bf9b1d 100644
   --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
   +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
   @@ -560,25 +560,12 @@ object StructType extends AbstractDataType {
        }
   
      /**
   -   * This works a little similarly to `merge`, but it does not actually merge two DataTypes.
   -   * This method just merges nullability.
   +   * Merge both struct types but it follows UNION semantic blah blah
       */
   -  private[sql] def mergeNullability(left: DataType, right: DataType): DataType =
   -    (left, right) match {
   -      case (ArrayType(leftElementType, leftContainsNull),
   -          ArrayType(rightElementType, rightContainsNull)) =>
   -        ArrayType(
   -          mergeNullability(leftElementType, rightElementType),
   -          leftContainsNull || rightContainsNull)
   -
   -      case (MapType(leftKeyType, leftValueType, leftContainsNull),
   -          MapType(rightKeyType, rightValueType, rightContainsNull)) =>
   -        MapType(
   -          mergeNullability(leftKeyType, rightKeyType),
   -          mergeNullability(leftValueType, rightValueType),
   -          leftContainsNull || rightContainsNull)
   -
   -      case (StructType(leftFields), StructType(rightFields)) =>
   +  private[sql] def unionLikeMerge(left: DataType, right: DataType): DataType =
   +    mergeInternal(left, right, (s1: StructType, s2: StructType) => {
   +      val leftFields = s1.fields
   +      val rightFields = s2.fields
            require(leftFields.size == rightFields.size, "To merge nullability, " +
              "two structs must have same number of fields.")
   
   @@ -586,32 +573,17 @@ object StructType extends AbstractDataType {
              case (leftField @ StructField(_, leftType, leftNullable, _),
                  _ @ StructField(_, rightType, rightNullable, _)) =>
                leftField.copy(
   -              dataType = mergeNullability(leftType, rightType),
   +              dataType = unionLikeMerge(leftType, rightType),
                  nullable = leftNullable || rightNullable)
            }.toSeq
            StructType(newFields)
   -
   -      case (leftType, _) =>
   -        leftType
   -    }
   +    })
   
      private[sql] def merge(left: DataType, right: DataType): DataType =
   -    (left, right) match {
   -      case (ArrayType(leftElementType, leftContainsNull),
   -      ArrayType(rightElementType, rightContainsNull)) =>
   -        ArrayType(
   -          merge(leftElementType, rightElementType),
   -          leftContainsNull || rightContainsNull)
   -
   -      case (MapType(leftKeyType, leftValueType, leftContainsNull),
   -      MapType(rightKeyType, rightValueType, rightContainsNull)) =>
   -        MapType(
   -          merge(leftKeyType, rightKeyType),
   -          merge(leftValueType, rightValueType),
   -          leftContainsNull || rightContainsNull)
   -
   -      case (StructType(leftFields), StructType(rightFields)) =>
   -        val newFields = mutable.ArrayBuffer.empty[StructField]
   +    mergeInternal(left, right, (s1: StructType, s2: StructType) => {
   +      val leftFields = s1.fields
   +      val rightFields = s2.fields
   +      val newFields = mutable.ArrayBuffer.empty[StructField]
   
            val rightMapped = fieldsMap(rightFields)
            leftFields.foreach {
   @@ -641,6 +613,27 @@ object StructType extends AbstractDataType {
              }
   
            StructType(newFields.toSeq)
   +    })
   +
   +  private def mergeInternal(
   +      left: DataType,
   +      right: DataType,
   +      mergeStruct: (StructType, StructType) => StructType): DataType = {
   +    (left, right) match {
   +      case (ArrayType(leftElementType, leftContainsNull),
   +      ArrayType(rightElementType, rightContainsNull)) =>
   +        ArrayType(
   +          merge(leftElementType, rightElementType),
   +          leftContainsNull || rightContainsNull)
   +
   +      case (MapType(leftKeyType, leftValueType, leftContainsNull),
   +      MapType(rightKeyType, rightValueType, rightContainsNull)) =>
   +        MapType(
   +          merge(leftKeyType, rightKeyType),
   +          merge(leftValueType, rightValueType),
   +          leftContainsNull || rightContainsNull)
   +
   +      case (s1: StructType, s2: StructType) => mergeStruct(s1, s2)
   
          case (DecimalType.Fixed(leftPrecision, leftScale),
            DecimalType.Fixed(rightPrecision, rightScale)) =>
   @@ -666,6 +659,7 @@ object StructType extends AbstractDataType {
          case _ =>
            throw QueryExecutionErrors.cannotMergeIncompatibleDataTypesError(left, right)
        }
   +  }
   
      private[sql] def fieldsMap(fields: Array[StructField]): Map[String, StructField] = {
        // Mimics the optimization of breakOut, not present in Scala 2.13, while working in 2.12
   diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
   index b5ae4b3fac8..8e0080a2469 100644
   --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
   +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
   @@ -684,7 +684,7 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
        children.map(_.output).transpose.map { attrs =>
          val firstAttr = attrs.head
          val nullable = attrs.exists(_.nullable)
   -      val newDt = attrs.map(_.dataType).reduce(StructType.mergeNullability)
   +      val newDt = attrs.map(_.dataType).reduce(StructType.unionLikeMerge)
          if (firstAttr.dataType == newDt) {
            firstAttr.withNullability(nullable)
          } else {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921777534


   **[Test build #143403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143403/testReport)** for PR 34025 at commit [`9bed2e6`](https://github.com/apache/spark/commit/9bed2e665a9f7e72c7abfb7277bbd631397dce39).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921320074


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47874/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921522916


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47898/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921624065


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143386/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921481445


   **[Test build #143390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143390/testReport)** for PR 34025 at commit [`051a152`](https://github.com/apache/spark/commit/051a152c3ad9a5e7d30c0c1f98b69d4448a0aa66).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r711180321



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {

Review comment:
       Oh, yea, I combined them together but forgot to remove "SPARK-36673: Only merge nullability for unionByName of struct". Let me open a followup to remove it. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710539301



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))

Review comment:
       That's said, for `union`, we don't actually care about case sensitivity as by definition we resolve column by position.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921201477


   **[Test build #143363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143363/testReport)** for PR 34025 at commit [`924b413`](https://github.com/apache/spark/commit/924b4130244ec73df200dc6f3b9509dd8cafba71).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921199105


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710681748



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val df = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    assert(df.schema == schema)
+    assert(df.queryExecution.optimizedPlan.schema == schema)
+    assert(df.queryExecution.executedPlan.schema == schema)
+
+    checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+    checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+  }
+
+  test("SPARK-36673: Union of structs with different orders") {
+    val df1 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner1"), struct(expr("id * 10 AS inner2"))))
+    val df2 = spark.range(2).withColumn("nested",
+      struct(expr("id * 5 AS inner2"), struct(expr("id * 10 AS inner1"))))

Review comment:
       I think we should think more about how union by ordinal should work.
   
   if we have 2 tables t1: [inner1, inner2] and t2: [inner2, inner1], we can union them. But if they are inner fields, we can't union. AFAIK we are trying to make nested columns the first-class citizen and this inner-field-only limitation doesn't make much sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921604885


   **[Test build #143386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143386/testReport)** for PR 34025 at commit [`ad4ec57`](https://github.com/apache/spark/commit/ad4ec57867beddc3e45bb739e26c4f87f2385ee4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r710442316



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,23 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Incorrect Unions of struct") {

Review comment:
       Could you be more specific if this aims to `nullability` issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921231350


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47870/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921423017






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921803733


   thanks, merging to master/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r711180321



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {

Review comment:
       Oh, yea, I combined them together but forgot to remove `SPARK-36673: Only merge nullability for unionByName of struct`. Let me open a followup to remove it. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a change in pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
xkrogen commented on a change in pull request #34025:
URL: https://github.com/apache/spark/pull/34025#discussion_r711156296



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
##########
@@ -1018,6 +1018,64 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession {
     unionDF = df1.unionByName(df2)
     checkAnswer(unionDF, expected)
   }
+
+  test("SPARK-36673: Only merge nullability for Unions of struct") {
+    val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS INNER")))
+    val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
+
+    val union1 = df1.union(df2)
+    val union2 = df1.unionByName(df2)
+
+    val schema = StructType(Seq(StructField("id", LongType, false),
+      StructField("nested", StructType(Seq(StructField("INNER", LongType, false))), false)))
+
+    Seq(union1, union2).foreach { df =>
+      assert(df.schema == schema)
+      assert(df.queryExecution.optimizedPlan.schema == schema)
+      assert(df.queryExecution.executedPlan.schema == schema)
+
+      checkAnswer(df, Row(0, Row(0)) :: Row(1, Row(5)) :: Row(0, Row(0)) :: Row(1, Row(5)) :: Nil)
+      checkAnswer(df.select("nested.*"), Row(0) :: Row(5) :: Row(0) :: Row(5) :: Nil)
+    }
+  }
+
+  test("SPARK-36673: Only merge nullability for unionByName of struct") {

Review comment:
       Isn't this test a duplicate of `SPARK-36673: Only merge nullability for Unions of struct` ? In that test, we do both `union` and `unionByName`. It seems here we have the exact same test scenario, just limited to `unionByName`, which is already covered.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921624065


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143386/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921599635


   **[Test build #143403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143403/testReport)** for PR 34025 at commit [`9bed2e6`](https://github.com/apache/spark/commit/9bed2e665a9f7e72c7abfb7277bbd631397dce39).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34025: [SPARK-36673][SQL] Fix incorrect schema of nested types of union

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34025:
URL: https://github.com/apache/spark/pull/34025#issuecomment-921650076


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47911/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org