You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "mihailom-db (via GitHub)" <gi...@apache.org> on 2024/03/19 13:50:17 UTC

[PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

mihailom-db opened a new pull request, #45592:
URL: https://github.com/apache/spark/pull/45592

   ### What changes were proposed in this pull request?
   This PR adds DEFAULT_COLLATION configuration to `SqlApiConf` and makes sure literals are created with default collation. This PR also renames misused isDefaultCollation in the code.
   
   ### Why are the changes needed?
   These changes are needed to keep clean and consistent code. Also this is closely related to casting rules. Default collation is defined as session level collation and not as UTF8_BINARY. We had UTF8_BINARY as default as implementation does not call ICU.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Users can define different default collation, but casting in functions is still not supported.
   
   
   ### How was this patch tested?
   Added test to `CollationSuite`.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1539087381


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala:
##########
@@ -493,7 +493,7 @@ private[columnar] trait DirectCopyColumnType[JvmType] extends ColumnType[JvmType
 }
 
 private[columnar] object STRING
-  extends NativeColumnType(PhysicalStringType(CollationFactory.DEFAULT_COLLATION_ID), 8)
+  extends NativeColumnType(PhysicalStringType(SqlApiConf.get.defaultStringType.collationId), 8)

Review Comment:
   @dbatomic I changed it, but am not sure if it should be default or just utf8_binary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1531938884


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {
+  /**
+   * Check if default collation is up-to-date with a config.
+   */
+  private def refreshDefaultCollation(): Unit = {
+    if (SqlApiConf.get.defaultCollation != DEFAULT_COLLATION) {

Review Comment:
   Again, this might be slow.
   1) @cloud-fan / @MaxGekk - is this the right way to react on config updates?
   2) If we keep it like this, it would be good to make sure that `isDefaultCollation` is never called during query execution phase since this will likely be too slow. Maybe we can even add - `AnalysisHelper.inAnalyzer` check here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537860683


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {
+    sql("SET spark.sql.session.collation.default=UNICODE")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    sql("SET spark.sql.session.collation.default=UTF8_BINARY")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UTF8_BINARY")))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   oh, it's because we store table schema as JSON string in the metastore, and parse it back later. The data type parsing does respect the default collation setting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537210767


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -74,7 +74,8 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
       case (TIMESTAMP_LTZ, Nil) => TimestampType
       case (STRING, Nil) =>
         typeCtx.children.asScala.toSeq match {
-          case Seq(_) => StringType
+          case Seq(_) =>
+            StringType(CollationFactory.collationNameToId(SqlApiConf.get.defaultCollation))

Review Comment:
   Yeah, I was concerned about this too, as some tests did fail because of this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535246864


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {

Review Comment:
   You don't need Spark in class name. It is already in spark namespace.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535377436


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -68,23 +68,30 @@ public static class Collation {
      * Binary collation implies that UTF8Strings are considered equal only if they are
      * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Binary collation implies that UTF8Strings are compared on byte level.

Review Comment:
   Can you extend a bit comment description of these two fields?
   
   E.g.:
   1) Update comment for supportsBinaryEquality to reflect name change
   2) Update comment in supportsBinaryOrdering.
   3) Say that supportsBinaryEq means that it's safe to call binaryEquals in UTF8String, which is more performant.
   4) Say that supportsBinaryOrdering means that its safe to call binaryCompare.
   5) Say that supportsBinaryOrdering implies supportsBinaryEquality (we can add an assert in constructor for this).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537387309


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringTypeUtils.scala:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.{SparkContext, SparkEnv}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SqlApiConf
+
+object StringTypeUtils extends Logging {

Review Comment:
   We can make `SqlApiConf.get.defaultCollation` to return the collation id, and move the implementation there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538476186


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -74,7 +74,8 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
       case (TIMESTAMP_LTZ, Nil) => TimestampType
       case (STRING, Nil) =>
         typeCtx.children.asScala.toSeq match {
-          case Seq(_) => StringType
+          case Seq(_) =>
+            SqlApiConf.get.defaultStringType

Review Comment:
   ```suggestion
             case Seq(_) => SqlApiConf.get.defaultStringType
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537889609


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4999,7 +4999,12 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
 
   def collationEnabled: Boolean = getConf(COLLATION_ENABLED)
 
-  def defaultCollation: String = getConf(DEFAULT_COLLATION)
+  override def defaultStringType: StringType = {
+    if (getConf(DEFAULT_COLLATION) == "UTF8_BINARY") {
+      StringType
+    }
+    else StringType(CollationFactory.collationNameToId(getConf(DEFAULT_COLLATION)))
+  }

Review Comment:
   ```suggestion
       } else {
         StringType(CollationFactory.collationNameToId(getConf(DEFAULT_COLLATION)))
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538483160


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,38 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)
+           |USING PARQUET
+           |""".stripMargin)
+      sql("INSERT INTO t VALUES ('a')")
+      checkAnswer(sql("SELECT collation(c1) FROM t"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Create table with UTF8_BINARY, make sure collation persists on read") {
+    sql(
+      s"""
+         |CREATE TABLE t(c1 STRING)

Review Comment:
   wrap with `withTable` please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538482045


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala:
##########
@@ -493,7 +493,7 @@ private[columnar] trait DirectCopyColumnType[JvmType] extends ColumnType[JvmType
 }
 
 private[columnar] object STRING
-  extends NativeColumnType(PhysicalStringType(CollationFactory.DEFAULT_COLLATION_ID), 8)
+  extends NativeColumnType(PhysicalStringType(SqlApiConf.get.defaultStringType.collationId), 8)

Review Comment:
   Do we need to change this? It looks this object does not care about collation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538478109


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -29,25 +29,32 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
 @Stable
 class StringType private(val collationId: Int) extends AtomicType with Serializable {
   /**
-   * Returns whether assigned collation is the default spark collation (UTF8_BINARY).
+   * Support for Binary Equality implies that strings are considered equal only if
+   * they are byte for byte equal. E.g. all accent or case-insensitive collations are considered
+   * non-binary. If this field is true, byte level operations can be used against this datatype
+   * (e.g. for equality and hashing).
    */
-  def isDefaultCollation: Boolean = collationId == CollationFactory.DEFAULT_COLLATION_ID
+  def supportsBinaryEquality: Boolean =
+    CollationFactory.fetchCollation(collationId).supportsBinaryEquality
+  def isUTF8BinaryLcaseCollation: Boolean =
+    collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID
 
   /**
-   * Binary collation implies that strings are considered equal only if they are
-   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
-   * If this field is true, byte level operations can be used against this datatype (e.g. for
-   * equality and hashing).
+   * Support for Binary Ordering implies that strings are considered equal only

Review Comment:
   shall we mention ordering?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538480741


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4991,6 +4999,14 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
 
   def collationEnabled: Boolean = getConf(COLLATION_ENABLED)
 
+  override def defaultStringType: StringType = {
+    if (getConf(DEFAULT_COLLATION) == "UTF8_BINARY") {

Review Comment:
   ```suggestion
       if (getConf(DEFAULT_COLLATION).toUpperCase == "UTF8_BINARY") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530435638


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -31,7 +32,8 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
   /**
    * Returns whether assigned collation is the default spark collation (UTF8_BINARY).
    */
-  def isDefaultCollation: Boolean = collationId == CollationFactory.DEFAULT_COLLATION_ID
+  def isDefaultCollation: Boolean =

Review Comment:
   My fear is that this may be perf sensitive place. can you keep this as simple int compare? `collationNameToId` is dictionary search against string field...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532017826


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =

Review Comment:
   This is collation that we do not use ICU library for any comparison (equals, hash, compare). All other binary collations need to call ICU for compare.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530431684


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -109,7 +109,7 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
-  public static final int DEFAULT_COLLATION_ID = 0;
+  public static final int SPARK_INTERNAL_COLLATION_ID = 0;

Review Comment:
   Can you call this simply `UTF8_BINARY_COLLATION_ID`? I think adding any other adjective will only bring confusion (what does `INTERNAL` mean?).
   
   Also, please rename `LOWERCASE_COLLATION_ID`. It should be `UTF_BINARY_LCASE_COLLATION_ID`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535362379


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -68,23 +68,30 @@ public static class Collation {
      * Binary collation implies that UTF8Strings are considered equal only if they are
      * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Binary collation implies that UTF8Strings are compared on byte level.
+     * All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean supportsBinaryOrdering;
 
     public Collation(
         String collationName,
         Collator collator,
         Comparator<UTF8String> comparator,
         String version,
         ToLongFunction<UTF8String> hashFunction,
-        boolean isBinaryCollation) {
+        boolean supportsBinaryEquality,

Review Comment:
   Thank you for doing this, it finally starts to look like a proper API :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537195738


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -65,26 +65,39 @@ public static class Collation {
     public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
 
     /**
-     * Binary collation implies that UTF8Strings are considered equal only if they are
-     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     * Support for Binary Equality implies that it is possible to check equality on
+     * byte by byte level. This allows for the usage of binaryEquals call on UTF8Strings
+     * which is more performant than calls to external ICU library.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Support for Binary Ordering implies that it is possible to check equality and compare on
+     * byte by byte level. This allows for the usage of binaryEquals and binaryCompare calls on
+     * UTF8Strings which is more performant than calls to external ICU library. Support for
+     * Binary Ordering implies support for Binary Equality.

Review Comment:
   shall we add an assert to enforce this in the constructor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537195051


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -65,26 +65,39 @@ public static class Collation {
     public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
 
     /**
-     * Binary collation implies that UTF8Strings are considered equal only if they are
-     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     * Support for Binary Equality implies that it is possible to check equality on
+     * byte by byte level. This allows for the usage of binaryEquals call on UTF8Strings
+     * which is more performant than calls to external ICU library.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Support for Binary Ordering implies that it is possible to check equality and compare on

Review Comment:
   ```suggestion
        * Support for Binary Ordering implies that it is possible to check ordering and compare on
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537205409


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {

Review Comment:
   We don't need to test this if we don't cache the conf value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538483034


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,38 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)
+           |USING PARQUET
+           |""".stripMargin)
+      sql("INSERT INTO t VALUES ('a')")
+      checkAnswer(sql("SELECT collation(c1) FROM t"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Create table with UTF8_BINARY, make sure collation persists on read") {
+    sql(
+      s"""
+         |CREATE TABLE t(c1 STRING)
+         |USING PARQUET
+         |""".stripMargin)
+    sql("INSERT INTO t VALUES ('a')")
+    checkAnswer(sql("SELECT collation(c1) FROM t"), Seq(Row("UTF8_BINARY")))
+    sql("SET spark.sql.session.collation.default=UNICODE")

Review Comment:
   use `withSQLConf` to set conf in test cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538819666


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,44 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)
+           |USING PARQUET
+           |""".stripMargin)
+      sql("INSERT INTO t VALUES ('a')")
+      checkAnswer(sql("SELECT collation(c1) FROM t"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Create table with UTF8_BINARY, make sure collation persists on read") {
+    val tableName = "t"
+    withTable(tableName) {
+      withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UTF8_BINARY") {
+        sql(
+          s"""
+             |CREATE TABLE $tableName(c1 STRING)
+             |USING PARQUET
+             |""".stripMargin)
+        sql(s"INSERT INTO $tableName VALUES ('a')")
+        checkAnswer(sql(s"SELECT collation(c1) FROM $tableName"), Seq(Row("UTF8_BINARY")))
+        checkAnswer(sql("SET spark.sql.session.collation.default=UNICODE"),

Review Comment:
   we should have two `withSQLConf` blocks, one for CREATE TABLE, one for the final SELECT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1534043872


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {
+  /**
+   * Check if default collation is up-to-date with a config.
+   */
+  private def refreshDefaultCollation(): Unit = {
+    if (SqlApiConf.get.defaultCollation != DEFAULT_COLLATION) {
+      DEFAULT_COLLATION = SqlApiConf.get.defaultCollation
+      DEFAULT_COLLATION_ID = CollationFactory.collationNameToId(DEFAULT_COLLATION)

Review Comment:
   to enforce it's only called at the driver side, we can assert `SparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIER`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530424813


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(s), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(s),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case s: UTF8String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(s, StringType)
+      }
+      else {
+        Literal(s, StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case c: Char =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(c.toString), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(c.toString),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case ac: Array[Char] =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(String.valueOf(ac)), StringType)

Review Comment:
   can't you use `String.valueOf` for others as well? AFAIK it accepts `Object` as well.
   
   if not at least try to extract some of this functionality into a common method or something because it seems repetitive for these 4 cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530443120


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>

Review Comment:
   Do you need `== CollationFactory.SPARK_INTERNAL_COLLATION_ID` check?
   Shouldn't `StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)` work even for collationId = 0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1533545441


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -757,6 +757,13 @@ object SQLConf {
       .booleanConf
       .createWithDefault(Utils.isTesting)
 
+  val DEFAULT_COLLATION =
+    buildConf(SqlApiConfHelper.DEFAULT_COLLATION)
+      .doc("Default collation when considering casting rules.")

Review Comment:
   Improved it. Need to add a few changes to some expressions that return StringType ti support this version of doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532180679


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =

Review Comment:
   Please update comment. This is not `Spark internal collation`.
   
   In short:
   `isBinaryCollation` - ability to do hashing and binary equality.
   `isUTF8BinaryCollation` - above + binary ordering.
   
   Maybe we should rename these into:
   `supportsBinaryEquality` and `supportsBinaryOrdering`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537196013


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -65,26 +65,39 @@ public static class Collation {
     public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
 
     /**
-     * Binary collation implies that UTF8Strings are considered equal only if they are
-     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     * Support for Binary Equality implies that it is possible to check equality on
+     * byte by byte level. This allows for the usage of binaryEquals call on UTF8Strings
+     * which is more performant than calls to external ICU library.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Support for Binary Ordering implies that it is possible to check equality and compare on
+     * byte by byte level. This allows for the usage of binaryEquals and binaryCompare calls on
+     * UTF8Strings which is more performant than calls to external ICU library. Support for
+     * Binary Ordering implies support for Binary Equality.

Review Comment:
   oh we already did.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532175896


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {
+  /**
+   * Check if default collation is up-to-date with a config.
+   */
+  private def refreshDefaultCollation(): Unit = {
+    if (SqlApiConf.get.defaultCollation != DEFAULT_COLLATION) {
+      DEFAULT_COLLATION = SqlApiConf.get.defaultCollation
+      DEFAULT_COLLATION_ID = CollationFactory.collationNameToId(DEFAULT_COLLATION)

Review Comment:
   I would just like to make sure that we never end up calling this on execution side. E.g. if we end up making hash map lookup per row we are dead. Could we use `AnalysisHelper.inAnalyzer` flag to make sure that this field is only accessed during analysis?
   
   This is a healthy thing to do for Delta side as well - Delta should never rely on session level collation (e.g. for constraints/computed columns) so it should be good if we can actually enforce this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530459134


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>

Review Comment:
   It will work, but it will break backwards compatibility until we implement Casting. All rules only match object StringType for now, but not the class as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1533417781


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =

Review Comment:
   I am not sure this naming is the best. Are we sure that other orderings in ICU are not binary implemented as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537862661


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {
+    sql("SET spark.sql.session.collation.default=UNICODE")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    sql("SET spark.sql.session.collation.default=UTF8_BINARY")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UTF8_BINARY")))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   oh, can we add a test that:
   1. we create a table with default collation being UTF8_BINARY
   2. we change the default collation setting and read the table back. It should still be UTF8_BINARY



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537202766


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringTypeUtils.scala:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.{SparkContext, SparkEnv}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SqlApiConf
+
+object StringTypeUtils extends Logging {

Review Comment:
   Do we still need this? My preference is to avoid premature perf optimization until we have numbers to prove we need to improve the perf.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537205070


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {
+    sql("SET spark.sql.session.collation.default=UNICODE")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    sql("SET spark.sql.session.collation.default=UTF8_BINARY")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UTF8_BINARY")))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   I'm wondering how do we make this work. Where do we update the code to respect the default collation when reading parquet?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537205763


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -65,26 +65,39 @@ public static class Collation {
     public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
 
     /**
-     * Binary collation implies that UTF8Strings are considered equal only if they are
-     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     * Support for Binary Equality implies that it is possible to check equality on
+     * byte by byte level. This allows for the usage of binaryEquals call on UTF8Strings
+     * which is more performant than calls to external ICU library.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Support for Binary Ordering implies that it is possible to check equality and compare on

Review Comment:
   I wanted to say that this type of collation needs both equality and ordering to be possible. Should I maybe change the wording to equality and ordering.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537243129


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringTypeUtils.scala:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.{SparkContext, SparkEnv}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SqlApiConf
+
+object StringTypeUtils extends Logging {

Review Comment:
   When we use collation internally we need collationId, but the config is a collationName. I am just not sure where to include this conversion then. We also want to make sure the conversion is not called on execution side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532022227


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {
+  /**
+   * Check if default collation is up-to-date with a config.
+   */
+  private def refreshDefaultCollation(): Unit = {
+    if (SqlApiConf.get.defaultCollation != DEFAULT_COLLATION) {
+      DEFAULT_COLLATION = SqlApiConf.get.defaultCollation
+      DEFAULT_COLLATION_ID = CollationFactory.collationNameToId(DEFAULT_COLLATION)

Review Comment:
   This is just a hash map lookup, is it really worth the complexity to build a cache for the default collation id?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1531933608


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =
+    collationId == CollationFactory.UTF8_BINARY_COLLATION_ID
+
   /**
    * Type name that is shown to the customer.
    * If this is an UTF8_BINARY collation output is `string` due to backwards compatibility.
    */
   override def typeName: String =
-    if (isDefaultCollation) "string"
+    if (isDefaultCollation(collationId)) "string"

Review Comment:
   I think that this is not ok. We should keep calling `UTF8_BINARY` as `string` regardless of default collation.
   Typename should remain the same regardless of current collation. Especially since AFAIK we can serialize this typeName.
   
   This should be `isUTF8BinaryCollation` check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537387707


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringTypeUtils.scala:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.{SparkContext, SparkEnv}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SqlApiConf
+
+object StringTypeUtils extends Logging {

Review Comment:
   or even further, add `SqlApiConf.get.defaultStringType` which returns `StringType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537396548


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {
+    sql("SET spark.sql.session.collation.default=UNICODE")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    sql("SET spark.sql.session.collation.default=UTF8_BINARY")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UTF8_BINARY")))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   AFAIK in table metadata we will push explicit collation - `COLLATE UNICODE`. The rules are following:
   1) We take default session level collation if no collation is specified.
   2) If collation is UTF8_BINARY (being it implicit through default collation or explicit) we write `string` in metadata.
   3) If collation is anything else we write `string collate collation_name` in metadata.
   
   When reading we don't care about session default collation. We will just read whatever is in metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537860683


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,33 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Check if config changes value") {
+    sql("SET spark.sql.session.collation.default=UNICODE")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    sql("SET spark.sql.session.collation.default=UTF8_BINARY")
+    checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UTF8_BINARY")))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   oh, it's because we store table schema as JSON string in the metastore, and parse it back later. The data type parsing does respect the default collation setting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538479084


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,16 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+        Literal(UTF8String.fromString(s), SqlApiConf.get.defaultStringType)
+    case s: UTF8String =>
+      Literal(s, SqlApiConf.get.defaultStringType)
+    case c: Char =>
+      Literal(UTF8String.fromString(c.toString),
+        SqlApiConf.get.defaultStringType)

Review Comment:
   ```suggestion
         Literal(UTF8String.fromString(c.toString), SqlApiConf.get.defaultStringType)
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,16 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+        Literal(UTF8String.fromString(s), SqlApiConf.get.defaultStringType)
+    case s: UTF8String =>
+      Literal(s, SqlApiConf.get.defaultStringType)
+    case c: Char =>
+      Literal(UTF8String.fromString(c.toString),
+        SqlApiConf.get.defaultStringType)
+    case ac: Array[Char] =>
+        Literal(UTF8String.fromString(String.valueOf(ac)),
+          SqlApiConf.get.defaultStringType)

Review Comment:
   ```suggestion
           Literal(UTF8String.fromString(String.valueOf(ac)), SqlApiConf.get.defaultStringType)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535719135


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -68,23 +68,30 @@ public static class Collation {
      * Binary collation implies that UTF8Strings are considered equal only if they are
      * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Binary collation implies that UTF8Strings are compared on byte level.

Review Comment:
   Should be improved now 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530431037


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -109,7 +109,7 @@ public Collation(
   private static final Collation[] collationTable = new Collation[4];
   private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
 
-  public static final int DEFAULT_COLLATION_ID = 0;
+  public static final int SPARK_INTERNAL_COLLATION_ID = 0;

Review Comment:
   is the SPARK part of the name really necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530424813


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(s), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(s),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case s: UTF8String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(s, StringType)
+      }
+      else {
+        Literal(s, StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case c: Char =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(c.toString), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(c.toString),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case ac: Array[Char] =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(String.valueOf(ac)), StringType)

Review Comment:
   can't you use `String.valueOf` for others as well? AFAIK it can accept `Object`.
   
   if not at least try to extract some of this functionality into a common method or something because it seems repetitive for these 4 cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532017712


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -757,6 +757,13 @@ object SQLConf {
       .booleanConf
       .createWithDefault(Utils.isTesting)
 
+  val DEFAULT_COLLATION =
+    buildConf(SqlApiConfHelper.DEFAULT_COLLATION)
+      .doc("Default collation when considering casting rules.")

Review Comment:
   This is too vague. I have no idea what it means. Looking at the code changes of this PR, `CREATE TABLE t(c STRING)` is also affected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537198080


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -74,7 +74,8 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
       case (TIMESTAMP_LTZ, Nil) => TimestampType
       case (STRING, Nil) =>
         typeCtx.children.asScala.toSeq match {
-          case Seq(_) => StringType
+          case Seq(_) =>
+            StringType(CollationFactory.collationNameToId(SqlApiConf.get.defaultCollation))

Review Comment:
   a safer approach is to still return the object `StringType` if the default collation is `UTF8_BINARY_COLLATION_ID`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45592:
URL: https://github.com/apache/spark/pull/45592#issuecomment-2020337359

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538818654


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,44 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)
+           |USING PARQUET
+           |""".stripMargin)
+      sql("INSERT INTO t VALUES ('a')")
+      checkAnswer(sql("SELECT collation(c1) FROM t"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Create table with UTF8_BINARY, make sure collation persists on read") {
+    val tableName = "t"

Review Comment:
   it's ok to hardcode "t" in the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538818293


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,44 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   nit: this CREATE TABLE statement is short and can be put in one line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530424813


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(s), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(s),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case s: UTF8String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(s, StringType)
+      }
+      else {
+        Literal(s, StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case c: Char =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(c.toString), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(c.toString),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case ac: Array[Char] =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(String.valueOf(ac)), StringType)

Review Comment:
   can't you use `String.valueOf` for others as well? AFAIK it can accept `Object`.
   
   or even just `obj.ToString()` - that might be even better
   
   if not at least try to extract some of this functionality into a common method or something because it seems repetitive for these 4 cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530447723


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(UTF8String.fromString(s), StringType)
+      }
+      else {
+        Literal(UTF8String.fromString(s),
+          StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))
+      }
+    case s: UTF8String =>
+      if (CollationFactory.collationNameToId(SQLConf.get.defaultCollation)
+        == CollationFactory.SPARK_INTERNAL_COLLATION_ID) {
+        Literal(s, StringType)
+      }
+      else {
+        Literal(s, StringType(CollationFactory.collationNameToId(SQLConf.get.defaultCollation)))

Review Comment:
   IMO we should keep interallly `defaultCollation` as an integer but expose it to customer as a string (e.g. on SET defaultCollation customer should see collation name).
   
   That should also make this code a bit nicer - just `Literal(s, StringType(SQLConf.get.defaultCollationId))`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1530477319


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,41 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>

Review Comment:
   I still don't get it? `StringType(0)` should match against `StringType`. I just tried this:
   
   ```
       val s = StringType(0)
       s match {
         case StringType => assert(true)
         case _ => assert(false)
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1532014547


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =

Review Comment:
   so this is a special case of binary collation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537227941


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -65,26 +65,39 @@ public static class Collation {
     public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
 
     /**
-     * Binary collation implies that UTF8Strings are considered equal only if they are
-     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     * Support for Binary Equality implies that it is possible to check equality on
+     * byte by byte level. This allows for the usage of binaryEquals call on UTF8Strings
+     * which is more performant than calls to external ICU library.
      */
-    public final boolean isBinaryCollation;
+    public final boolean supportsBinaryEquality;
+    /**
+     * Support for Binary Ordering implies that it is possible to check equality and compare on

Review Comment:
   oh, then `... equality and ordering ...` sounds better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1538817589


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -638,7 +641,44 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
       parameters = Map(
         "fieldName" -> "c2",
         "expressionStr" -> "UCASE(struct1.a)",
-        "reason" -> "generation expression cannot contain non-default collated string type"))
+        "reason" ->
+          "generation expression cannot contain non-binary orderable collated string type"))
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, literal test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      checkAnswer(sql(s"SELECT collation('aa')"), Seq(Row("UNICODE")))
+    }
+  }
+
+  test("SPARK-47431: Default collation set to UNICODE, column type test") {
+    withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+      sql(
+        s"""
+           |CREATE TABLE t(c1 STRING)

Review Comment:
   `withTable` please



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535685109


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkStringTypeUtils.scala:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.internal.SqlApiConf
+
+trait SparkStringTypeUtils {
+  /**
+   * Check if default collation is up-to-date with a config.
+   */
+  private def refreshDefaultCollation(): Unit = {
+    if (SqlApiConf.get.defaultCollation != DEFAULT_COLLATION) {
+      DEFAULT_COLLATION = SqlApiConf.get.defaultCollation
+      DEFAULT_COLLATION_ID = CollationFactory.collationNameToId(DEFAULT_COLLATION)

Review Comment:
   Could you take a look at how I added this check, as a lot of tests are failing for reason SparkEnv.get == null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "mihailom-db (via GitHub)" <gi...@apache.org>.

mihailom-db commented on PR #45592:
URL: https://github.com/apache/spark/pull/45592#issuecomment-2016499111

   @dbatomic @stefankandic Could you take a look at the failing tests. I am not sure if this is going to be a problem for storage. What I am concerned with is if this change, to return StringType(0) instead of StringType for when default is UTF8_BINARY, might pose problems to backwards compatibility of storage. Apart from that, PR seems to work fine. Also as a followup I will create a ticket to make sure expressions return default stringtype if there were no string inputs in the parameter list of the function call/ expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1537200476


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -69,10 +69,16 @@ object Literal {
     case f: Float => Literal(f, FloatType)
     case b: Byte => Literal(b, ByteType)
     case s: Short => Literal(s, ShortType)
-    case s: String => Literal(UTF8String.fromString(s), StringType)
-    case s: UTF8String => Literal(s, StringType)
-    case c: Char => Literal(UTF8String.fromString(c.toString), StringType)
-    case ac: Array[Char] => Literal(UTF8String.fromString(String.valueOf(ac)), StringType)
+    case s: String =>
+        Literal(UTF8String.fromString(s), StringType(StringTypeUtils.getDefaultCollationId))

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL] Add session level default Collation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45592: [SPARK-47431][SQL] Add session level default Collation
URL: https://github.com/apache/spark/pull/45592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47431][SQL][COLLATION] Add session level default Collation [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45592:
URL: https://github.com/apache/spark/pull/45592#discussion_r1535254272


##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -41,12 +37,22 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
    */
   def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
 
+  /**
+   * Spark internal collation implies that strings are considered equal only if they are
+   * byte for byte equal. E.g. all accent or case-insensitive collations are considered non-binary.
+   * Also their comparison does not require ICU library calls, as ordering follows
+   * spark internal implementation. If this field is true, byte level operations can be
+   * used against this datatype (e.g. for equality, hashing and sorting).
+   */
+  def isUTF8BinaryCollation: Boolean =

Review Comment:
   Maybe this should be the naming:
   1) Keep `isUTF8Binary` (or explicitly check id). This is mostly used for naming - e.g. typeName in StringType needs to check this. If that is the only place we don't need to expose it in CollationFactory.
   2) Add `supportsBinaryEquality` - check whether equality can be done on binary level for this collation.
   3) Add `supportBinaryOrdering` - check whether sort can be done on binary level.
   
   Probably isUTF8Binary and supportsBinaryOrdering will be true only for UTF8_BINARY, since I don't think that any other ICU driven library supports binary sort. But in future, user defined collations may support it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org