You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "mihailom-db (via GitHub)" <gi...@apache.org> on 2024/02/26 13:51:07 UTC

[PR] [SPARK-47169][SQL] Disable bucketing on collated columns [spark]

mihailom-db opened a new pull request, #45260:
URL: https://github.com/apache/spark/pull/45260

   ### What changes were proposed in this pull request?
   Disable bucketing on columns that are non default collated.
   
   ### Why are the changes needed?
   With current implementation bucketIds are generated from a string value where each unique string guarantees unique id, but when collation is turned on, this is not the case.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it adds new error class.
   
   
   ### How was this patch tested?
   ```
   ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.SparkThrowableSuite test
   ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.CollationSuite test
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47169][SQL] Disable bucketing on collated columns [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45260:
URL: https://github.com/apache/spark/pull/45260#discussion_r1519359745


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -1752,6 +1752,12 @@
     },
     "sqlState" : "22003"
   },
+  "INVALID_BUCKET_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for bucket column."

Review Comment:
   Changed. Is this sufficient?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47169][SQL] Disable bucketing on collated columns [WIP] [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45260:
URL: https://github.com/apache/spark/pull/45260#discussion_r1506454082


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -1863,6 +1863,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
         "normalizedPartCols" -> normalizedPartCols.mkString(", ")))
   }
 
+  def invalidBucketColumnDataTypeError(dataType: String): Throwable = {
+    new AnalysisException(
+      errorClass = "INVALID_BUCKET_COLUMN_DATA_TYPE",
+      messageParameters = Map("type" -> dataType))

Review Comment:
   Please, wrap `dataType` by `toSQLType`. It converts any data type to string in unified way across all errors.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -359,6 +359,13 @@ case class PreprocessTableCreation(catalog: SessionCatalog) extends Rule[Logical
           case other => failAnalysis(s"Cannot use ${other.catalogString} for sorting column")
         }
 
+        schema.filter(f => normalizedBucketSpec.bucketColumnNames.contains(f.name))
+          .foreach { field =>
+          if (!BucketingUtils.canBucketOn(field.dataType)) {
+            throw QueryCompilationErrors.invalidBucketColumnDataTypeError(field.dataType.toString)

Review Comment:
   Just pass `field.dataType`



##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -1752,6 +1752,12 @@
     },
     "sqlState" : "22003"
   },
+  "INVALID_BUCKET_COLUMN_DATA_TYPE" : {
+    "message" : [
+      "Cannot use <type> for bucket column."

Review Comment:
   Could you help users, and explain why it cannot (or which types can be used) otherwise not clear how users should deal with the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47169][SQL] Disable bucketing on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45260:
URL: https://github.com/apache/spark/pull/45260#issuecomment-1986752104

   @mihailom-db Could you enable and trigger GitHub action in your PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47169][SQL] Disable bucketing on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45260:
URL: https://github.com/apache/spark/pull/45260#issuecomment-1991397620

   +1, LGTM. Merging to master.
   Thank you, @mihailom-db.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47169][SQL] Disable bucketing on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45260:
URL: https://github.com/apache/spark/pull/45260#discussion_r1521272118


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -362,6 +362,13 @@ case class PreprocessTableCreation(catalog: SessionCatalog) extends Rule[Logical
           case other => failAnalysis(s"Cannot use ${other.catalogString} for sorting column")
         }
 
+        schema.filter(f => normalizedBucketSpec.bucketColumnNames.contains(f.name))
+          .foreach { field =>
+          if (!BucketingUtils.canBucketOn(field.dataType)) {
+            throw QueryCompilationErrors.invalidBucketColumnDataTypeError(field.dataType)
+          }
+        }

Review Comment:
   Fix indentation:
   ```suggestion
             .foreach { field =>
               if (!BucketingUtils.canBucketOn(field.dataType)) {
                 throw QueryCompilationErrors.invalidBucketColumnDataTypeError(field.dataType)
               }
             }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47169][SQL] Disable bucketing on collated columns [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45260: [SPARK-47169][SQL] Disable bucketing on collated columns
URL: https://github.com/apache/spark/pull/45260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org