You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "mihailom-db (via GitHub)" <gi...@apache.org> on 2024/02/27 11:47:48 UTC

[PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

mihailom-db opened a new pull request, #45285:
URL: https://github.com/apache/spark/pull/45285

   ### What changes were proposed in this pull request?
   This PR adds `COLLATION_ENABLED` config to `SQLConf` and introduces new error class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage of feature under development. 
   
   ### Why are the changes needed?
   We want to make collations configurable on this flag. These changes disable usage of `collate` and `collation` functions, along with any `COLLATE` syntax when the flag is set to false. By default, the flag is set to false.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. It introduces new error along with an appropriate message.
   
   
   ### How was this patch tested?
   ```
   ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.errors.QueryCompilationErrorsSuite test
   ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.CollationExpressionSuite test
   ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.CollationSuite test
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on PR #45285:
URL: https://github.com/apache/spark/pull/45285#issuecomment-1970706405

   Current SQLConf.COLLATION_ENABLED cannot be used to prevent errors of invalid collation name in TypeContext. How should I proceed with this? In my view, if customer sends a query with word COLLATE, but invalid string literal, I would expect to get back error Collation not supported before getting error of Invalid Collation name. I investigated this and am pretty sure SQLConf is not the right way to go if we want to include this config for this error. @cloud-fan @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505536348


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   Do you maybe have some insight on ordering of class/trait extensions in Scala?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507274792


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   Problem still persists for query of the type ```CREATE TABLE t(col STRING COLLATE 'UNKNOWN_COLLATION_NAME') USING parquet```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506075900


##########
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##########
@@ -223,6 +223,7 @@ abstract class BaseSessionStateBuilder(
         HiveOnlyCheck +:
         TableCapabilityCheck +:
         CommandCheck +:
+        CollationCheck +:

Review Comment:
   need to do the same in `HiveSessionStateBuilder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505370756


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -90,6 +100,9 @@ case class Collate(child: Expression, collationName: String)
   since = "4.0.0",
   group = "string_funcs")
 case class Collation(child: Expression) extends UnaryExpression with RuntimeReplaceable {
+  if (!SQLConf.get.collationEnabled) {

Review Comment:
   It might be too much for a temporary config. Maybe let's add a new check rule in `BaseSessionStateBuilder.analyzer.extendedCheckRules`, and fail if these two expressions are found in the plan tree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507252815


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   Concretely the call of `AstBuilder.visitCollate` happens



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507242125


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   how can we hit this branch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505514617


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   all spark test suites should be a subclass of `SparkFunSuite` which sets this flag.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505706359


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   I have just debugged without `CollationSuite.beforeAll()` override. Order of calling was `SharedSparkSession.beforeAll()` -> `SQLConf.COLLATION_ENABLED` setting -> `SparkFunSuite.beforeAll()`. This is causing the proposed test to write `true` even though it was not set for the actual purpose and that is before SQLConf invocation, so that the desired config is set with the right value. Could you check that compilation does the same for you? I am not sure why the compilation proceeds in the following manner, but if needed I could investigate more?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1510851292


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   This is not resolved yet. `SubqueryExpression` is an expression that contains query plans, we need to recurse into the query plan from it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1510853996


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   oh nvm, the framework calls check rules for subquery expressions already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507245924


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   oh, then we need to match `SubqueryExpression` manually here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505527557


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   I am not sure what happens here, but I would presume that `SharedSparkSession` overrules the `SparkFunSuite` as the former is used as an interface after the latter. What I want to say is that without me explicitly saying this part of the code, IS_TESTING is not set to true and I suspect it is because of the `SharedSparkSession` which does not set that key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505618055


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   I tried this locally in this suite and it prints true correctly
   ```
     test("aa") {
       println("##########    " + Utils.isTesting)
     }
   ```
   
   can you double check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504410342


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,6 +74,9 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
+  if (!SQLConf.get.collationEnabled) {

Review Comment:
   `Expression` is private API. The only way end-user can use this expression is through SQL function lookup, so the check in `CollateExpressionBuilder` should be sufficient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504413598


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   hmm is it necessary? The Spark test framework should have done it already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504888862


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -297,6 +297,14 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
     )
   }
 
+  def collationNotEnabledError(): Throwable = {
+    new AnalysisException(
+      errorClass = "UNSUPPORTED_FEATURE.COLLATION",
+      messageParameters = Map(
+        "collationEnabled" -> SQLConf.COLLATION_ENABLED.key)

Review Comment:
   Please, quote the SQL config by `toSQLConf`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -90,6 +100,9 @@ case class Collate(child: Expression, collationName: String)
   since = "4.0.0",
   group = "string_funcs")
 case class Collation(child: Expression) extends UnaryExpression with RuntimeReplaceable {
+  if (!SQLConf.get.collationEnabled) {

Review Comment:
   Could you add the parameter `collationEnabled` to the case class, and avoid the check in the constructor. This will avoid an issue fixed by this PR: https://github.com/apache/spark/pull/27658 (see `Why are the changes needed?`) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507250569


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   Problem is when we use collation function in the format of the query above, the check for collationName id is preformed before creating Collate expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1508930731


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,28 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {
+      case operator: LogicalPlan =>
+        operator transformExpressionsUp {

Review Comment:
   we are performing checks instead of updating the tree, so we can simply do `operator.expressions.foreach(_.foreach(hasCollationExpression))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507153207


##########
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala:
##########
@@ -964,6 +964,52 @@ class QueryCompilationErrorsSuite
         "className" -> "org.apache.spark.sql.catalyst.expressions.UnsafeRow"))
   }
 
+  test("SPARK-47102: Collate in CollateClauseContext when COLLATION_ENABLED is false") {

Review Comment:
   Some tests need context because of the change I introduced in the builder stage, so I incorporated your feedback in two separate tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1508556060


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   I think I added it now, could you check if this is what you were expecting?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506256531


##########
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala:
##########
@@ -964,6 +964,52 @@ class QueryCompilationErrorsSuite
         "className" -> "org.apache.spark.sql.catalyst.expressions.UnsafeRow"))
   }
 
+  test("SPARK-47102: Collate in CollateClauseContext when COLLATION_ENABLED is false") {

Review Comment:
   I think you can fold all those 4 tests to one like:
   ```scala
     test("SPARK-47102: the collation feature is off") {
       withSQLConf(SQLConf.COLLATION_ENABLED.key -> "false") {
         Seq(
           "CREATE TABLE t(col STRING COLLATE 'UNICODE_CI') USING parquet",
           "SELECT collate('aaa', 'UNICODE_CI')",
           "SELECT 'aaa' COLLATE 'UNICODE_CI'",
           "select collation('aaa')"
         ).foreach { sqlText =>
           checkError(
             exception = intercept[AnalysisException](sql(sqlText)),
             errorClass = "UNSUPPORTED_FEATURE.COLLATION")
         }
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45285: [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag
URL: https://github.com/apache/spark/pull/45285


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504406480


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -3854,6 +3854,11 @@
           "Catalog <catalogName> does not support <operation>."
         ]
       },
+      "COLLATION" : {
+        "message" : [
+          "Collation support is under development and not all features are supported. To enable existing features set <collationEnabled> to `true`."

Review Comment:
   when we turn off this feature, it means we do not want to expose it to the end users and it's weird if we still tell them how to enable it.
   
   How about just say `String collation is not supported yet.`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45285:
URL: https://github.com/apache/spark/pull/45285#issuecomment-1977197356

   @mihailom-db The test failure is related to your changes:
   ```
   [info]   Cause: org.scalatest.exceptions.TestFailedException: Function 'collate', Expression class 'org.apache.spark.sql.catalyst.expressions.CollateExpressionBuilder' "...ql.collation.enabled[	]true" did not equal "...ql.collation.enabled[ ]true"
   ```
   Please, fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507151474


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   I tried doing this, but I get error that transformAllExpressionsWithSubqueries cannot be called in analyzer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507244351


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   Query "SELECT 'aaa' COLLATE 'UNKNOWN_COLLATION_NAME'"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507243652


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -3198,7 +3198,13 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging {
         commentSpec = Some(spec)
       }
     }
-
+    val dataType = typedVisit[DataType](ctx.dataType)
+    if (dataType.existsRecursively(
+        dt => dt.isInstanceOf[StringType]) &&

Review Comment:
   we should check `isDefaultCollation` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45285:
URL: https://github.com/apache/spark/pull/45285#issuecomment-1978621347

   +1, LGTM. Merging to master.
   Thank you, @mihailom-db and @cloud-fan @dbatomic @mkaravel for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504415553


##########
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala:
##########
@@ -964,6 +964,35 @@ class QueryCompilationErrorsSuite
         "className" -> "org.apache.spark.sql.catalyst.expressions.UnsafeRow"))
   }
 
+  test("SPARK-47102: Collation in CollateContext when COLLATION_ENABLED is false") {
+    withSQLConf(SQLConf.COLLATION_ENABLED.key -> "false") {
+      checkError(
+        exception = intercept[AnalysisException] {
+          sql(s"select 'aaa' collate 'UNICODE_CI'")

Review Comment:
   how about the type parsing like `CREATE TABLE t(col STRING COLLATE ...)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506074914


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   we can call `transformAllExpressionsWithSubqueries`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506068842


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -3854,6 +3854,11 @@
           "Catalog <catalogName> does not support <operation>."
         ]
       },
+      "COLLATION" : {
+        "message" : [
+          "Collation feature is not yet supported."

Review Comment:
   The full message is `The feature is not supported: Collation feature is not yet supported.`. Seems we can omit the second `feature`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506078176


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   What's wrong with this order of calling? `Utils.isTesting` returns true properly and we are fine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507247915


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   shouldn't we fail very early in the parser side if collation is not enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507150465


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -3198,7 +3198,11 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging {
         commentSpec = Some(spec)
       }
     }
-
+    val dataType = typedVisit[DataType](ctx.dataType)
+    if (!SQLConf.get.collationEnabled &&
+      dataType.isInstanceOf[StringType] && !dataType.asInstanceOf[StringType].isDefaultCollation) {

Review Comment:
   I only managed to find only existsRecursively, is that good?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1508927759


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   where do you match `SubqueryExpression`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507269159


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   I managed to fix this by setting `collationName` to lazy val in visitCollate. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506081331


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -753,6 +753,14 @@ object SQLConf {
       .checkValue(_ > 0, "The initial number of partitions must be positive.")
       .createOptional
 
+  val COLLATION_ENABLED =

Review Comment:
   we should make it lazy val, in case the default value (Utils.isTesting) gets materialized before the test environment is set up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mkaravel (via GitHub)" <gi...@apache.org>.
mkaravel commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504599200


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -3854,6 +3854,11 @@
           "Catalog <catalogName> does not support <operation>."
         ]
       },
+      "COLLATION" : {
+        "message" : [
+          "Collation support is under development and not all features are supported. To enable existing features set <collationEnabled> to `true`."

Review Comment:
   +1 on what @cloud-fan suggest. Let's make it simple and just mention that it is unsupported.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -40,6 +41,12 @@ import org.apache.spark.sql.types._
   group = "string_funcs")
 object CollateExpressionBuilder extends ExpressionBuilder {
   override def build(funcName: String, expressions: Seq[Expression]): Expression = {
+    // Needed in order to make sure no parse errors appear when Collation Support is not enabled
+    // because this might suggest to the customer that feature is enabled until they reproduce
+    // the right syntax
+    if (!SQLConf.get.collationEnabled) {
+      throw QueryCompilationErrors.collationNotEnabledError()
+    }

Review Comment:
   When is this executed? How often?
   How come we do not block this at the analyzer level?



##########
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala:
##########
@@ -964,6 +964,35 @@ class QueryCompilationErrorsSuite
         "className" -> "org.apache.spark.sql.catalyst.expressions.UnsafeRow"))
   }
 
+  test("SPARK-47102: Collation in CollateContext when COLLATION_ENABLED is false") {
+    withSQLConf(SQLConf.COLLATION_ENABLED.key -> "false") {
+      checkError(
+        exception = intercept[AnalysisException] {
+          sql(s"select 'aaa' collate 'UNICODE_CI'")

Review Comment:
   +1 on adding such a test.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   We have already set the SQL config to `Utils.isTesting`. This should not be necessary at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505476279


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   Actually, I have just checked, `DatasourceV2SQLBase` extends `QueryTest` and is implementing `SharedSparkSession`. `QueryTest` does not implement `beforeAll` and `SharedSparkSession` does, but without setting IS_TESTING key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505984306


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -40,6 +41,12 @@ import org.apache.spark.sql.types._
   group = "string_funcs")
 object CollateExpressionBuilder extends ExpressionBuilder {
   override def build(funcName: String, expressions: Seq[Expression]): Expression = {
+    // Needed in order to make sure no parse errors appear when Collation Support is not enabled
+    // because this might suggest to the customer that feature is enabled until they reproduce
+    // the right syntax
+    if (!SQLConf.get.collationEnabled) {
+      throw QueryCompilationErrors.collationNotEnabledError()
+    }

Review Comment:
   Changed now, but needed to keep some check in the builder, as analyzer did not fail if collation name was wrong, and we would give confusing messages to the user in that case. I presume we do not want to give them suggestions on how to fix syntax if collation is disabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1507245967


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -67,7 +73,13 @@ object CollateExpressionBuilder extends ExpressionBuilder {
  */
 case class Collate(child: Expression, collationName: String)
   extends UnaryExpression with ExpectsInputTypes {
-  private val collationId = CollationFactory.collationNameToId(collationName)
+  private val collationId = if (!SQLConf.get.collationEnabled) {
+    // This is here if someone enters a query which only has invalid collation name
+    throw QueryCompilationErrors.collationNotEnabledError()

Review Comment:
   Comment below is related to this problem and I am not sure how to proceed on it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505536348


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   Do you maybe have some insight in ordering of class/trait extensions in Scala?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1508931880


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,28 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {
+      case operator: LogicalPlan =>
+        operator transformExpressionsUp {
+          case e@(_: Collation | _: Collate) =>
+            if (!SQLConf.get.collationEnabled) {
+              throw QueryCompilationErrors.collationNotEnabledError()
+            }
+            e
+          case other =>
+            if (other.children.exists(hasCollationExpression)) {
+              if (!SQLConf.get.collationEnabled) {
+                throw QueryCompilationErrors.collationNotEnabledError()
+              }
+            }
+            other
+        }
+    }
+  }
+
+  private def hasCollationExpression(expression: Expression): Boolean =

Review Comment:
   I think it should be `isCollationExpression`, as it checks the given expression itself, not its children.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1506070480


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -3198,7 +3198,11 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging {
         commentSpec = Some(spec)
       }
     }
-
+    val dataType = typedVisit[DataType](ctx.dataType)
+    if (!SQLConf.get.collationEnabled &&
+      dataType.isInstanceOf[StringType] && !dataType.asInstanceOf[StringType].isDefaultCollation) {

Review Comment:
   we should use `dataType.exists`, to check nested field data type as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45285:
URL: https://github.com/apache/spark/pull/45285#issuecomment-1978618455

   The failed GA [Run / Build modules: pyspark-connect](https://github.com/uros-db/spark/actions/runs/8154382989/job/22292454313#logs) has been passed already a couple comments before. Highly likely, it is not related to the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1504206127


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -3854,6 +3854,11 @@
           "Catalog <catalogName> does not support <operation>."
         ]
       },
+      "COLLATION_SUPPORT_NOT_ENABLED" : {

Review Comment:
   Just feature name is enough:
   ```suggestion
         "COLLATION" : {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505436616


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   Previously this Suite extended some other class which did not set this key to true, new merges created a conflict for me. Will remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1505618055


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -32,6 +33,11 @@ import org.apache.spark.sql.types.StringType
 class CollationSuite extends DatasourceV2SQLBase {
   protected val v2Source = classOf[FakeV2ProviderWithCustomSchema].getName
 
+  protected override def beforeAll(): Unit = {
+    System.setProperty(IS_TESTING.key, "true")

Review Comment:
   I tried this and it prints true correctly
   ```
     test("aa") {
       println("##########    " + Utils.isTesting)
     }
   ```
   
   can you double check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1509146815


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##########
@@ -40,6 +41,11 @@ import org.apache.spark.sql.types._
   group = "string_funcs")
 object CollateExpressionBuilder extends ExpressionBuilder {
   override def build(funcName: String, expressions: Seq[Expression]): Expression = {
+    // We need to throw this error first, as we do not want user to see

Review Comment:
   Is this comment helpful? Throw error first comparing to what?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47102][SQL][COLLATION] Add COLLATION_ENABLED config flag [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.
mihailom-db commented on code in PR #45285:
URL: https://github.com/apache/spark/pull/45285#discussion_r1510854919


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##########
@@ -592,3 +593,18 @@ case class QualifyLocationWithWarehouse(catalog: SessionCatalog) extends Rule[Lo
       c.copy(tableDesc = newTable)
   }
 }
+
+object CollationCheck extends (LogicalPlan => Unit) {
+  def apply(plan: LogicalPlan): Unit = {
+    plan.foreach {

Review Comment:
   Aha, I thought we avoided `SubqueryExpression` needs with the addition of operator.expressions.foreach. Will look into it now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org