You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/25 09:50:30 UTC

[GitHub] [spark] MaxGekk opened a new pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

MaxGekk opened a new pull request #29234:
URL: https://github.com/apache/spark/pull/29234


   ### What changes were proposed in this pull request?
   Add tests to check that errors for nested and top-level duplicate columns are the same for JSON, Avro, Parquet and ORC datasources.
   
   
   ### Why are the changes needed?
   To improve test coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Run modified test suites:
   ```
   $ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite"
   $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665520954






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665941972






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663862336






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-666132055


   @HyukjinKwon @cloud-fan Could you review this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665520179






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663862336






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663862149


   **[Test build #126539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126539/testReport)** for PR 29234 at commit [`cc971f4`](https://github.com/apache/spark/commit/cc971f42df4ca2f5e7c7466303940dfab943aba9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665942755






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663836149


   **[Test build #126539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126539/testReport)** for PR 29234 at commit [`cc971f4`](https://github.com/apache/spark/commit/cc971f42df4ca2f5e7c7466303940dfab943aba9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663836223






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r461401131



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +44,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val queue = new Queue[StructType]()
+    queue.enqueue(schema)
+    do {
+      val struct = queue.dequeue()
+      checkColumnNameDuplication(struct.map(_.name), colType, caseSensitiveAnalysis)
+      val nestedStructs = struct.map(_.dataType).collect { case st: StructType => st }
+      queue.enqueue(nestedStructs: _*)
+    } while (queue.nonEmpty)

Review comment:
       Should we just do this in a recursive way? I think the performance here doesn't matter, and we usually does that.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +44,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val queue = new Queue[StructType]()
+    queue.enqueue(schema)
+    do {
+      val struct = queue.dequeue()
+      checkColumnNameDuplication(struct.map(_.name), colType, caseSensitiveAnalysis)
+      val nestedStructs = struct.map(_.dataType).collect { case st: StructType => st }
+      queue.enqueue(nestedStructs: _*)
+    } while (queue.nonEmpty)

Review comment:
       Should we just do this in a recursive way? I think the performance here doesn't matter, and we usually do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r461445888



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +44,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val queue = new Queue[StructType]()
+    queue.enqueue(schema)
+    do {
+      val struct = queue.dequeue()
+      checkColumnNameDuplication(struct.map(_.name), colType, caseSensitiveAnalysis)
+      val nestedStructs = struct.map(_.dataType).collect { case st: StructType => st }
+      queue.enqueue(nestedStructs: _*)
+    } while (queue.nonEmpty)

Review comment:
       +1

##########
File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##########
@@ -1800,6 +1800,44 @@ abstract class AvroSuite extends QueryTest with SharedSparkSession {
       assert(version === SPARK_VERSION_SHORT)
     }
   }
+
+  test("SPARK-32431: consistent error for nested and top-level duplicate columns") {

Review comment:
       is it an existing problem that avro needs to duplicate the file source tests?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##########
@@ -43,10 +43,56 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 
+trait NestedDataSourceSuiteBase extends QueryTest with SharedSparkSession{
+  protected val nestedDataSources: Seq[String]
+
+  test("SPARK-32431: consistent error for nested and top-level duplicate columns") {
+    Seq(
+      Seq("id AS lowercase", "id + 1 AS camelCase") ->
+        new StructType()
+          .add("LowerCase", LongType)
+          .add("camelcase", LongType)
+          .add("CamelCase", LongType),
+      Seq("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") ->
+        new StructType().add("StructColumn",
+          new StructType()
+            .add("LowerCase", LongType)
+            .add("camelcase", LongType)
+            .add("CamelCase", LongType))
+    ).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) =>
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {

Review comment:
       shall we test both v1 and v2?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #29234:
URL: https://github.com/apache/spark/pull/29234


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gatorsmile commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

gatorsmile commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r484593626



##########
File path: docs/sql-migration-guide.md
##########
@@ -29,6 +29,8 @@ license: |
   - In Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.
   
   - In Spark 3.1, `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result `NULL`.
+  
+  - In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception `org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema` in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config `spark.sql.caseSensitive` while detecting column name duplicates.

Review comment:
       @MaxGekk Also update this for the changes made in https://github.com/apache/spark/pull/29317?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-664585913






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663836223






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665520954






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r461693321



##########
File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##########
@@ -1800,6 +1800,44 @@ abstract class AvroSuite extends QueryTest with SharedSparkSession {
       assert(version === SPARK_VERSION_SHORT)
     }
   }
+
+  test("SPARK-32431: consistent error for nested and top-level duplicate columns") {

Review comment:
       I put the common test to `NestedDataSourceSuiteBase`

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +44,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val queue = new Queue[StructType]()
+    queue.enqueue(schema)
+    do {
+      val struct = queue.dequeue()
+      checkColumnNameDuplication(struct.map(_.name), colType, caseSensitiveAnalysis)
+      val nestedStructs = struct.map(_.dataType).collect { case st: StructType => st }
+      queue.enqueue(nestedStructs: _*)
+    } while (queue.nonEmpty)

Review comment:
       I don't think the recursive functions is good approach. To have tail recursion, we should split the function on 2 parts because the function should call itself at the end:
   ```scala
     def checkSchemaColumnNameDuplication(
         schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
       checkSchemaColumnNameDuplication(Seq(schema), colType, caseSensitiveAnalysis)
     }
   
     @tailrec
     private def checkSchemaColumnNameDuplication(
        schemas: Seq[StructType], colType: String, caseSensitiveAnalysis: Boolean): Unit = {
       if (schemas.nonEmpty) {
         val structs = schemas.flatMap { fields =>
           checkColumnNameDuplication(fields.names, colType, caseSensitiveAnalysis)
           fields.collect { case StructField(_, st: StructType, _, _) => st }
         }
         checkSchemaColumnNameDuplication(structs, colType, caseSensitiveAnalysis)
       }
     }
   ```
   
   From my point of view, the code is:
   1. less readable
   2. less optimal
   3. error prone due stack overflow
   
   I would prefer to leave the code as is.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##########
@@ -43,10 +43,56 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 
+trait NestedDataSourceSuiteBase extends QueryTest with SharedSparkSession{
+  protected val nestedDataSources: Seq[String]
+
+  test("SPARK-32431: consistent error for nested and top-level duplicate columns") {
+    Seq(
+      Seq("id AS lowercase", "id + 1 AS camelCase") ->
+        new StructType()
+          .add("LowerCase", LongType)
+          .add("camelcase", LongType)
+          .add("CamelCase", LongType),
+      Seq("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") ->
+        new StructType().add("StructColumn",
+          new StructType()
+            .add("LowerCase", LongType)
+            .add("camelcase", LongType)
+            .add("CamelCase", LongType))
+    ).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) =>
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {

Review comment:
       AvroSuite tests both. We could run entire `FileBasedDataSourceSuite` for v1 and v2.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-663836149


   **[Test build #126539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126539/testReport)** for PR 29234 at commit [`cc971f4`](https://github.com/apache/spark/commit/cc971f42df4ca2f5e7c7466303940dfab943aba9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r462130640



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##########
@@ -43,10 +43,56 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 
+trait NestedDataSourceSuiteBase extends QueryTest with SharedSparkSession{
+  protected val nestedDataSources: Seq[String]
+
+  test("SPARK-32431: consistent error for nested and top-level duplicate columns") {
+    Seq(
+      Seq("id AS lowercase", "id + 1 AS camelCase") ->
+        new StructType()
+          .add("LowerCase", LongType)
+          .add("camelcase", LongType)
+          .add("CamelCase", LongType),
+      Seq("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") ->
+        new StructType().add("StructColumn",
+          new StructType()
+            .add("LowerCase", LongType)
+            .add("camelcase", LongType)
+            .add("CamelCase", LongType))
+    ).foreach { case (selectExpr: Seq[String], caseInsensitiveSchema: StructType) =>
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {

Review comment:
       I checked both v1 and v2, see `NestedDataSourceV1Suite` and `NestedDataSourceV2Suite`

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +44,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val queue = new Queue[StructType]()
+    queue.enqueue(schema)
+    do {
+      val struct = queue.dequeue()
+      checkColumnNameDuplication(struct.map(_.name), colType, caseSensitiveAnalysis)
+      val nestedStructs = struct.map(_.dataType).collect { case st: StructType => st }
+      queue.enqueue(nestedStructs: _*)
+    } while (queue.nonEmpty)

Review comment:
       I replaced the queue based implementation by recursive one. @HyukjinKwon Please, take a look at it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-666142201


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-664585913






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665941972


   **[Test build #126788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126788/testReport)** for PR 29234 at commit [`918c77c`](https://github.com/apache/spark/commit/918c77c996d632f76d9594724944550a32ce42a3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29234:
URL: https://github.com/apache/spark/pull/29234#discussion_r462557526



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala
##########
@@ -42,7 +42,27 @@ private[spark] object SchemaUtils {
    */
   def checkSchemaColumnNameDuplication(
       schema: StructType, colType: String, caseSensitiveAnalysis: Boolean = false): Unit = {
-    checkColumnNameDuplication(schema.map(_.name), colType, caseSensitiveAnalysis)
+    val fields = schema.fields
+    checkColumnNameDuplication(fields.map(_.name), colType, caseSensitiveAnalysis)
+    fields.map(_.dataType).foreach {
+      case st: StructType => checkSchemaColumnNameDuplication(st, colType, caseSensitiveAnalysis)
+      case _ =>

Review comment:
       How about `ArrayType(StructType)` and `MapType`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665942755






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29234: [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-665520179






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-664582455






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29234: [SPARK-32431][SQL][TESTS] Check of consistent error for nested and top-level duplicate columns

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29234:
URL: https://github.com/apache/spark/pull/29234#issuecomment-664586447






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org