You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "stefankandic (via GitHub)" <gi...@apache.org> on 2024/02/14 17:01:00 UTC

[PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

stefankandic opened a new pull request, #45104:
URL: https://github.com/apache/spark/pull/45104

### What changes were proposed in this pull request?

Disable hive style partitioning on columns that are non default collated

### Why are the changes needed?

With current implementation partitioning on columns that have either accent or case insensitive collation would lead to incorrect results.

### Does this PR introduce _any_ user-facing change?

Only compared to the latest master version

### How was this patch tested?

With new UTs

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1952875446

   > how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.
   
   @mihailom-db this seems like a fairly straightforward task so can you take a look at it when you have the time? Should be not much different than this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507789713


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   it isn't, there is a test `SQLInsertTestSuite.scala` checking for both cases where interval is used in a schema or in a partition column, so when I updated the error class for partitioning to use sql type it broke the test for checking for interval type in the schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1503072440


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   it seems that we don't even check if partitioning is allowed or not for v2 table creation ([code](https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L251)), is this a bug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504419139


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as we cannot implement string collation semantic with directory names
+    case st: StringType => st.isDefaultCollation
+    case other => other.isInstanceOf[AtomicType]

Review Comment:
   we missed the `if !a.isInstanceOf[VariantType]` part



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495894928


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as it could lead to incorrect results

Review Comment:
   let's be more specific. "... as we cannot implement string collation semantic with directory names."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "srielau (via GitHub)" <gi...@apache.org>.

srielau commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508421890


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   Actually the 0A000 was more correct.
   0A000 means a feature is not supported. It implies that in some future it may be and the user hit a limit rather than just writing something inherently wrong.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1945425646

   how about bucket columns? We generate the bucket id from the string value and assume all the semantically-same string values should generate the same bucket id, which isn't true for string with collation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1502960414


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -1742,8 +1742,8 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
 
   def cannotUseDataTypeForPartitionColumnError(field: StructField): Throwable = {
     new AnalysisException(
-      errorClass = "_LEGACY_ERROR_TEMP_1153",
-      messageParameters = Map("field" -> field.dataType.toString))
+      errorClass = "CANNOT_USE_DATA_TYPE_FOR_PARTITION_COLUMN",
+      messageParameters = Map("type" -> field.dataType.toString))

Review Comment:
   Could you convert `dataType` using `toSQLType`, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1505592354


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "0A000"

Review Comment:
   cc @srielau for the sqlstate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507766991


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   BTW, it is unrelated changes, it seems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507773660


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   If it is not related, please, revert this and the modified tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507819005


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   ok. I see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #45104:
URL: https://github.com/apache/spark/pull/45104#issuecomment-1971449799

   +1, LGTM. Merging to master.
   Thank you, @stefankandic and @dbatomic @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495056341


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns

Review Comment:
   Nit - on default collated columns :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495893377


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -388,6 +388,11 @@
     ],
     "sqlState" : "42846"
   },
+  "CANNOT_USE_DATA_TYPE_FOR_PARTITION_COLUMN" : {

Review Comment:
   INVALID_PARTITION_COLUMN_DATA_TYPE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504421790


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   nvm, v2 table does not use directory names to implement partitioning. We don't need to test it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508671400


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   thanks for the response! will fix it in a separate PR since this one is already closed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1508901070


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "42601"

Review Comment:
   Also let's stay consistent with error messages. E.g. in [string functions PR ](https://github.com/apache/spark/pull/45216) @uros-db is also using `0A000` for as `sqlState` so let us align there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][Collation] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1489873175


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns
+      createTable("id")
+      createTable("c2")
+      createTable("id", "c2")
+
+      Seq(Seq("c1"), Seq("c1", "id"), Seq("c1", "c2")).foreach { partitionColumns =>
+        checkError(
+          exception = intercept[AnalysisException] {
+            createTable(partitionColumns: _*)
+          },
+          errorClass = "_LEGACY_ERROR_TEMP_1153",

Review Comment:
   As you are here, could you assign proper name for the error class and improve error message format (if it is needed).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1507766991


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -5142,7 +5143,7 @@
   },
   "_LEGACY_ERROR_TEMP_1183" : {
     "message" : [
-      "Cannot use interval type in the table schema."
+      "Cannot use \"INTERVAL\" type in the table schema."

Review Comment:
   BTW, it is an unrelated change, it seems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #45104: [SPARK-47015][SQL] Disable partitioning on collated columns
URL: https://github.com/apache/spark/pull/45104


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1504436255


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,16 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    // non default collated strings should not be used as partition columns
+    // as we cannot implement string collation semantic with directory names
+    case st: StringType => st.isDefaultCollation
+    case other => other.isInstanceOf[AtomicType]

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495055910


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,14 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    case st: StringType => st.isDefaultCollation

Review Comment:
   Hey, I know that Spark codebase is not big on comments, but maybe we should add a comment why partitioning is disabled + jira ticket to enable the partitioning :) ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495897055


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName

Review Comment:
   can we also test v2 tables? See `DataSourceV2SQLSuite` about how we create and test v2 tables using v2 catalogs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1494460485


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -174,4 +174,36 @@ class CollationSuite extends QueryTest with SharedSparkSession {
           Row(expected))
     }
   }
+
+
+  test("disable partition on collated string column") {
+    def createTable(partitionColumns: String*): Unit = {
+      val tableName = "test_partition_tbl"
+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE TABLE $tableName
+             |(id INT, c1 STRING COLLATE 'UNICODE', c2 string)
+             |USING parquet
+             |PARTITIONED BY (${partitionColumns.mkString(",")})
+             |""".stripMargin)
+      }
+
+      // should work fine on non collated columns
+      createTable("id")
+      createTable("c2")
+      createTable("id", "c2")
+
+      Seq(Seq("c1"), Seq("c1", "id"), Seq("c1", "c2")).foreach { partitionColumns =>
+        checkError(
+          exception = intercept[AnalysisException] {
+            createTable(partitionColumns: _*)
+          },
+          errorClass = "_LEGACY_ERROR_TEMP_1153",

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1495832184


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
##########
@@ -574,6 +573,14 @@ object PartitioningUtils extends SQLConfHelper {
     }
   }
 
+  /**
+   * Checks whether a given data type can be used as a partition column.
+   */
+  def canPartitionOn(dateType: DataType): Boolean = dateType match {
+    case st: StringType => st.isDefaultCollation

Review Comment:
   added the comment but not sure what you meant regarding the jira ticket to enable the partitioning - there is no plan to do this AFAIK and also I don't see how it would even be possible unless we revamp the entire partitioning logic - ie. not write the actual value to the file system



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47015][SQL] Disable partitioning on collated columns [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45104:
URL: https://github.com/apache/spark/pull/45104#discussion_r1505593595


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -2304,6 +2304,12 @@
     },
     "sqlState" : "22023"
   },
+  "INVALID_PARTITION_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for partition column."
+    ],
+    "sqlState" : "0A000"

Review Comment:
   it seems many other INVALID_XXX errors pick `42` as the start of the sqlstate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org