You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "stefankandic (via GitHub)" <gi...@apache.org> on 2024/02/14 17:01:00 UTC

[PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

stefankandic opened a new pull request, #45104:
URL: https://github.com/apache/spark/pull/45104

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Disable hive style partitioning on columns that are non default collated
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   With current implementation partitioning on columns that have either accent or case insensitive collation would lead to incorrect results.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Only compared to the latest master version
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   With new UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1952875446

   > how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.
   
   @mihailom-db this seems like a fairly straightforward task so can you take a look at it when you have the time? Should be not much different than this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507789713


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   it isn't, there is a test `SQLInsertTestSuite.scala` checking for both cases where interval is used in a schema or in a partition column, so when I updated the error class for partitioning to use sql type it broke the test for checking for interval type in the schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1503072440


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   it seems that we don't even check if partitioning is allowed or not for v2 table creation ([code](https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L251)), is this a bug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504419139


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as we cannot implement string collation semantic with directory names
+    case st: StringType => st.isDefaultCollation
+    case other => other.isInstanceOf[AtomicType]

Review Comment:
   we missed the `if !a.isInstanceOf[VariantType]` part



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495894928


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as it could lead to incorrect results

Review Comment:
   let's be more specific. "... as we cannot implement string collation semantic with directory names."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508421890


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   Actually the 0A000 was more correct.
   0A000 means a feature is not supported. It implies that in some future it may be and the user hit a limit rather than just writing something inherently wrong.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1945425646

   how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1502960414


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -1742,8 +1742,8 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
 
   def cannotUseDataTypeForPartitionColumnError(field: StructField): Throwable = {
     new AnalysisException(
-      errorClass = "_LEGACY_ERROR_TEMP_1153",
-      messageParameters = Map("field" -> field.dataType.toString))
+      errorClass = "CANNOT_USE_DATA_TYPE_FOR_PARTITION_COLUMN",
+      messageParameters = Map("type" -> field.dataType.toString))

Review Comment:
   Could you convert `dataType` using `toSQLType`, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1505592354


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "0A000"

Review Comment:
   cc @srielau for the sqlstate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507766991


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   BTW, it is unrelated changes, it seems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507773660


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   If it is not related, please, revert this and the modified tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507819005


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   ok. I see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1971449799

   +1, LGTM. Merging to master.
   Thank you, @stefankandic and @dbatomic @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495056341


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns

Review Comment:
   Nit - on default collated columns :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495893377


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -388,6 +388,11 @@
     ],
     "sqlState" : "42846"
   },
+  "CANNOT_USE_DATA_TYPE_FOR_PARTITION_COLUMN" : {

Review Comment:
   INVALID_PARTITION_COLUMN_DATA_TYPE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504421790


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   nvm, v2 table does not use directory names to implement partitioning. We don't need to test it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508671400


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   thanks for the response! will fix it in a separate PR since this one is already closed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508901070


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   Also let's stay consistent with error messages. E.g. in [string functions PR ](https://github.com/apache/spark/pull/45216) @uros-db is also using `0A000` for as `sqlState` so let us align there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1489873175


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns
+      createTable("id")
+      createTable("c2")
+      createTable("id", "c2")
+
+      Seq(Seq("c1"), Seq("c1", "id"), Seq("c1", "c2")).foreach { partitionColumns =>
+        checkError(
+          exception = intercept[AnalysisException] {
+            createTable(partitionColumns: _*)
+          },
+          errorClass = "_LEGACY_ERROR_TEMP_1153",

Review Comment:
   As you are here, could you assign proper name for the error class and improve error message format (if it is needed).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507766991


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   BTW, it is an unrelated change, it seems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45104: [SPARK-47015][SQL] Disable partitioning on collated columns
URL: https://github.com/apache/spark/pull/45104


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504436255


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as we cannot implement string collation semantic with directory names
+    case st: StringType => st.isDefaultCollation
+    case other => other.isInstanceOf[AtomicType]

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495055910


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,14 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    case st: StringType => st.isDefaultCollation

Review Comment:
   Hey, I know that Spark codebase is not big on comments, but maybe we should add a comment why partitioning is disabled + jira ticket to enable the partitioning :) ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495897055


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   can we also test v2 tables? See `DataSourceV2SQLSuite` about how we create and test v2 tables using v2 catalogs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1494460485


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns
+      createTable("id")
+      createTable("c2")
+      createTable("id", "c2")
+
+      Seq(Seq("c1"), Seq("c1", "id"), Seq("c1", "c2")).foreach { partitionColumns =>
+        checkError(
+          exception = intercept[AnalysisException] {
+            createTable(partitionColumns: _*)
+          },
+          errorClass = "_LEGACY_ERROR_TEMP_1153",

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495832184


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,14 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    case st: StringType => st.isDefaultCollation

Review Comment:
   added the comment but not sure what you meant regarding the jira ticket to enable the partitioning - there is no plan to do this AFAIK and also I don't see how it would even be possible unless we revamp the entire partitioning logic - ie. not write the actual value to the file system



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1505593595


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "0A000"

Review Comment:
   it seems many other INVALID_XXX errors pick `42` as the start of the sqlstate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org