You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2024/01/31 10:11:07 UTC

[PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

dbatomic opened a new pull request, #44968:
URL: https://github.com/apache/spark/pull/44968

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR introduces CollationFactory singleton class which provides all collation aware methods, for given collation, that can be invoked against UTF8Strings.
   
   For higher level overview of Collation track please take a look at the umbrella [JIRA](https://issues.apache.org/jira/browse/SPARK-46830).
   
   At this point, CollationFactory is still not exposed, besides tests. The connection between UTF8String and CollationFactory is coming in the next PR.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Please refer to umbrella JIRA ticket for collation effort.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   At this point No, this is just prep for user facing changes.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   Unit tests for CollationFactory come with this PR. Tests are basic sanity tests, proper testing will come as we get E2E features.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #44968: [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations
URL: https://github.com/apache/spark/pull/44968


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476493722


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory._
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite
+  test("collationId stability") {
+    val ucsBasic = fetchCollation(0)
+    assert(ucsBasic.collationName == "UCS_BASIC")
+    assert(ucsBasic.isBinaryCollation)
+
+    val ucsBasicLcase = fetchCollation(1)
+    assert(ucsBasicLcase.collationName == "UCS_BASIC_LCASE")
+    assert(!ucsBasicLcase.isBinaryCollation)
+
+    val unicode = fetchCollation(2)
+    assert(unicode.collationName == "UNICODE")
+    assert(unicode.isBinaryCollation);
+
+    val unicodeCi = fetchCollation(3)
+    assert(unicodeCi.collationName == "UNICODE_CI")
+    assert(!unicodeCi.isBinaryCollation)
+  }
+
+  test("fetch invalid collation name") {
+    val error = intercept[SparkException] {
+      fetchCollation("UCS_BASIS")
+    }
+
+    assert(error.getMessage.contains(
+      "The value UCS_BASIS does not represent a correct collation name."))
+    assert(error.getMessage.contains("Suggested valid collation name: [UCS_BASIC]"))

Review Comment:
   I see. In any case, let's make this test independent from the error message format:
   ```scala
     test("fetch invalid collation name") {
       import scala.jdk.CollectionConverters._
   
       val e = intercept[SparkException] {
         fetchCollation("UCS_BASIS")
       }
   
       assert(e.getErrorClass === "COLLATION_INVALID_NAME")
       assert(e.getMessageParameters.asScala ===
         Map("proposal" -> "UCS_BASIC", "collationName" -> "UCS_BASIS"))
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475849461


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   Yeah, this probably is not the right error state. Waiting for your feedback on the appropriate one.
   
   I just checked PGSql and it returns `"42704 - undefined_object` Should we do the same?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475679694


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);

Review Comment:
   I mean, how do you get `"153.120.0.0"`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475681379


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory.getInstance
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite

Review Comment:
   we should extend `SparkFunSuite` instead of `AnyFunSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474537918


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,

Review Comment:
   nit: 4 spaces indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474555750


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new CollationInfo(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw SparkIllegalArgumentException.apply(

Review Comment:
   It's fine to throw `SparkException`. We have error class now and we don't need different exception classes for different error categories. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474560878


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1388,28 +1388,43 @@ public UTF8String copy() {
     return fromBytes(bytes);
   }
 
+  /**
+   * Collation aware comparison between two UTF8 strings.

Review Comment:
   Can we delay this change? I have not made my mind yet and I still feel awkward putting a "flag" in `UTF8String` to control the behavior of its APIs. It's probably better to keep UTF8String a pure data class and we create different comparators/hashers w.r.t. the collation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474700335


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;

Review Comment:
   With latest changes that keep UTF8String intact this conversation is no longer relevant. I removed binaryEquals from this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1477002292


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+      Comparator.comparing(UTF8String::toLowerCase),
+      "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collationTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collationTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collationTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collationTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collationTable.length; i++) {
+      collationNameToIdMap.put(collationTable[i].collationName, i);
+    }
+  }
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public static int collationNameToId(String collationName) throws SparkException {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      Collation suggestion = Collections.min(List.of(collationTable), Comparator.comparingInt(
+        c -> UTF8String.fromString(c.collationName).levenshteinDistance(
+          UTF8String.fromString(normalizedName))));
+
+      Map<String, String> params = new HashMap<>();
+      params.put("collationName", collationName);
+      params.put("proposal", suggestion.collationName);
+
+      throw new SparkException(
+       "COLLATION_INVALID_NAME", SparkException.constructMessageParams(params), null);

Review Comment:
   done



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+      Comparator.comparing(UTF8String::toLowerCase),
+      "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collationTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collationTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collationTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474537234


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;

Review Comment:
   shouldn't it be `UTF8String::equals`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475711921


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   Could you elaborate a little bit why did you chose the state:
   ```json
       "22018": {
           "description": "invalid character value for cast",
           "origin": "SQL/Foundation",
           "standard": "Y",
           "usedBy": ["SQL/Foundation", "PostgreSQL", "DB2", "Redshift", "Oracle", "SQL Server"]
       },
   ```



##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."

Review Comment:
   Could we propose some correct collation names that are similar to `collationName`? See for example:
   https://github.com/apache/spark/blob/a3432428e760fc16610cfe3380d3bdea7654f75d/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala#L109
   or
   https://github.com/apache/spark/blob/a3432428e760fc16610cfe3380d3bdea7654f75d/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala#L319



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474531863


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474700785


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new CollationInfo(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw SparkIllegalArgumentException.apply(

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474696087


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,

Review Comment:
   not sure if I am going blind? I see only 2 :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475972161


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   SGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475679195


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,

Review Comment:
   we should use 4 spaces indentation, but not a big deal as the style check doesn't enforce it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475848143


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."

Review Comment:
   Done. Please take a look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474568842


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;

Review Comment:
   No. Idea is following:
   1) Caller calls `UTF8String::equals`, as it does now (same goes for `compareTo`).
   2) `UTF8String` asks `CollationFactory` for appropriate equality/compare callback.
   3) `CollationFactory` returns `UTF8String::binaryEquals` if this is a binary collation. Otherwise it returns collation ICU provided equality check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474690556


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;

Review Comment:
   Sure. At this point `UTF8String` provides 32bit hash, but yeah, we can extend this to `long`. Nice trick with `ToLongFunction`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474710569


##########
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationFactorySuite.java:
##########
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.*;
+import static org.apache.spark.sql.catalyst.util.CollationFactory.*;
+import static org.apache.spark.unsafe.types.UTF8String.*;
+
+import java.util.Arrays;
+
+public class CollationFactorySuite {
+    @Test
+    public void collationIdStability() {
+      CollationInfo collationInfo = getInstance().fetchCollationInfo(0);
+      assertEquals(collationInfo.collationName, "UCS_BASIC");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(1);
+      assertEquals(collationInfo.collationName, "UCS_BASIC_LCASE");
+      assertFalse(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(2);
+      assertEquals(collationInfo.collationName, "UNICODE");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(3);
+      assertEquals(collationInfo.collationName, "UNICODE_CI");
+      assertFalse(collationInfo.isBinaryCollation);
+    }
+
+    @Test
+    public void collationFetchInvalidName() {
+      assertThrows(

Review Comment:
   Option is to also either move `SparkFunSuite` to common, or move the tests to `core`. I see that `UTF8StringPropertyCheckSuite` suffers from the same issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475754908


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);

Review Comment:
   It represents a hard coded version of collator used for this collation. e.g. `Collator.getInstance(ULocale.GERMAN).getVersion` will return "153.120.0.0" with current version of ICU. I added additional comment in the `version` field header.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475849461


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   Yeah, this probably is not the right error state. Waiting for your feedback on the appropriate one.
   
   I just checked PGSql and it returns `"description": "An undefined column or parameter name was detected.",` Should we do the same?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476531030


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory._
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite
+  test("collationId stability") {
+    val ucsBasic = fetchCollation(0)
+    assert(ucsBasic.collationName == "UCS_BASIC")
+    assert(ucsBasic.isBinaryCollation)
+
+    val ucsBasicLcase = fetchCollation(1)
+    assert(ucsBasicLcase.collationName == "UCS_BASIC_LCASE")
+    assert(!ucsBasicLcase.isBinaryCollation)
+
+    val unicode = fetchCollation(2)
+    assert(unicode.collationName == "UNICODE")
+    assert(unicode.isBinaryCollation);
+
+    val unicodeCi = fetchCollation(3)
+    assert(unicodeCi.collationName == "UNICODE_CI")
+    assert(!unicodeCi.isBinaryCollation)
+  }
+
+  test("fetch invalid collation name") {
+    val error = intercept[SparkException] {
+      fetchCollation("UCS_BASIS")
+    }
+
+    assert(error.getMessage.contains(
+      "The value UCS_BASIS does not represent a correct collation name."))
+    assert(error.getMessage.contains("Suggested valid collation name: [UCS_BASIC]"))

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1472738441


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {

Review Comment:
   nit: have you maybe considered naming this class just `Collation`?
   
   At least to me suffix `info` suggests that the class only holds some metadata about collation but doesn't actually perform them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475680925


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new CollationInfo(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw SparkIllegalArgumentException.apply(
+        "COLLATION_INVALID_NAME", Collections.singletonMap("collationName", collationName));
+    }
+  }
+
+  public CollationInfo fetchCollationInfo(int collationId) {
+    return collatorTable[collationId];
+  }
+
+  public CollationInfo fetchCollationInfo(String collationName) {
+    int collationId = collationNameToId(collationName);
+    return collatorTable[collationId];
+  }
+
+  public static CollationFactory getInstance() {

Review Comment:
   In Java, it's more common to make all methods static so that you don't need to get an instance of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475681690


##########
common/utils/src/main/scala/org/apache/spark/SparkException.scala:
##########
@@ -121,6 +121,16 @@ object SparkException {
       throw new SparkIllegalArgumentException(errorClass, messageParameters)
     }
   }
+
+  /**
+   * Utility method to construct message params from Java Map.
+   * @param messageParameters The Java Map.
+   * @return Scala collection that can be passed to SparkException constructor.
+   */
+  def constructMessageParams(
+    messageParameters: java.util.Map[String, String]): Map[String, String] = {

Review Comment:
   ditto, 4 spaces indentation for parameters



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476032744


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   > I just checked PGSql and it returns "42704 - undefined_object Should we do the same?
   
   The `42` SQLSTATE class seems reasonable: `"42": "Syntax Error or Access Rule Violation",`. The `42704` itself is a little bit generic but might be suitable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476405059


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory._
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite
+  test("collationId stability") {
+    val ucsBasic = fetchCollation(0)
+    assert(ucsBasic.collationName == "UCS_BASIC")
+    assert(ucsBasic.isBinaryCollation)
+
+    val ucsBasicLcase = fetchCollation(1)
+    assert(ucsBasicLcase.collationName == "UCS_BASIC_LCASE")
+    assert(!ucsBasicLcase.isBinaryCollation)
+
+    val unicode = fetchCollation(2)
+    assert(unicode.collationName == "UNICODE")
+    assert(unicode.isBinaryCollation);
+
+    val unicodeCi = fetchCollation(3)
+    assert(unicodeCi.collationName == "UNICODE_CI")
+    assert(!unicodeCi.isBinaryCollation)
+  }
+
+  test("fetch invalid collation name") {
+    val error = intercept[SparkException] {
+      fetchCollation("UCS_BASIS")
+    }
+
+    assert(error.getMessage.contains(
+      "The value UCS_BASIS does not represent a correct collation name."))
+    assert(error.getMessage.contains("Suggested valid collation name: [UCS_BASIC]"))

Review Comment:
   This suite is in `common` so I can't reference `SparkFunSuite` which is in `core`. I already had the same discussion with Wenchen in another comment thread. These tests are derived from pure `AnyFunSuite` (same as other scala tests under `common`.
   
   Once we add `CollationSuite` in core catalyst and open surface area for collations I will add the proper `checkError` tests there. That should take care of other languages as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1472598191


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new CollationInfo(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw SparkIllegalArgumentException.apply(
+        "COLLATION_INVALID_NAME", Collections.singletonMap("collationName", collationName));
+    }
+  }
+
+  public CollationInfo fetchCollationInfo(int collationId) {
+    return collatorTable[collationId];
+  }
+
+  public CollationInfo fetchCollationInfo(String collationName) {
+    int collationId = collationNameToId(collationName);
+    return collatorTable[collationId];
+  }
+
+  public static CollationFactory getInstance() {

Review Comment:
   @cloud-fan and @MaxGekk - looking for feedback here on proper design. `CollationFactory` is supposed to be readonly per JVM object. Let me know if singleton pattern is the proper approach here or this object should be nested into some other global object (e.g. `SparkEnv`?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474661370


##########
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationFactorySuite.java:
##########
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.*;
+import static org.apache.spark.sql.catalyst.util.CollationFactory.*;
+import static org.apache.spark.unsafe.types.UTF8String.*;
+
+import java.util.Arrays;
+
+public class CollationFactorySuite {
+    @Test
+    public void collationIdStability() {
+      CollationInfo collationInfo = getInstance().fetchCollationInfo(0);
+      assertEquals(collationInfo.collationName, "UCS_BASIC");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(1);
+      assertEquals(collationInfo.collationName, "UCS_BASIC_LCASE");
+      assertFalse(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(2);
+      assertEquals(collationInfo.collationName, "UNICODE");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(3);
+      assertEquals(collationInfo.collationName, "UNICODE_CI");
+      assertFalse(collationInfo.isBinaryCollation);
+    }
+
+    @Test
+    public void collationFetchInvalidName() {
+      assertThrows(

Review Comment:
   I rewrote it to scala. The issue is that `SparkFunSuite` is in `SparkCore` while SparkCollation probably should go in `common`. For now I am using pure `AnyFunSuite` until further advice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475744499


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,

Review Comment:
   Ah, so you are saying that for arguments, when they go to new line, we should use 4 spaces?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474549369


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();

Review Comment:
   It's more common to use `HashMap` in java.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475711921


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   Could you elaborate a little bit why did you choose the state:
   ```json
       "22018": {
           "description": "invalid character value for cast",
           "origin": "SQL/Foundation",
           "standard": "Y",
           "usedBy": ["SQL/Foundation", "PostgreSQL", "DB2", "Redshift", "Oracle", "SQL Server"]
       },
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476032744


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   > I just checked PGSql and it returns "42704 - undefined_object Should we do the same?
   
   The `42` SQLSTATE class seems reasonable: `"42": "Syntax Error or Access Rule Violation",`. The `42704` itself is a little bit generic but might be suitable. 
   
   @dbatomic Let's change 22018 to 42704.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474543494


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);

Review Comment:
   how do we get the version number?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475847921


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory.getInstance
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite

Review Comment:
   Talked offline about this. We are staying with `AnyFunSuite` since this is in `common` and we `SparkFunSuite` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474593790


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1388,28 +1388,43 @@ public UTF8String copy() {
     return fromBytes(bytes);
   }
 
+  /**
+   * Collation aware comparison between two UTF8 strings.

Review Comment:
   Sure, I will remove this part from PR.
   
   Also, if we decide to keep UTF8String as pure data class, I think that we should also remove `Comparable<UTF8String>` interface from it's declaration, given that these methods will no longer be valid.
   
   Also, note that there are other functions that will have to find a new place (e.g. `contains`, `matchAt`, `startsWith`, `endsWith`, `findInSet`, `indexOf`, `find`, `rfind`, `split`, `replace`) since all of them should work with both data and collation.
   
   i.e. pretty much anything that represents a relation between two or more UTF8String objects can't live anymore in UTF8String class if we don't push collationInfo into UTF8String.
   
   Anyhow, let's talk about this tomorrow. For now I will remove any changes to the UTF8String.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474699466


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1475740536


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new CollationInfo(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw SparkIllegalArgumentException.apply(
+        "COLLATION_INVALID_NAME", Collections.singletonMap("collationName", collationName));
+    }
+  }
+
+  public CollationInfo fetchCollationInfo(int collationId) {
+    return collatorTable[collationId];
+  }
+
+  public CollationInfo fetchCollationInfo(String collationName) {
+    int collationId = collationNameToId(collationName);
+    return collatorTable[collationId];
+  }
+
+  public static CollationFactory getInstance() {

Review Comment:
   Ok, going with all static approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476413709


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+            Comparator.comparing(UTF8String::toLowerCase), "1.0",

Review Comment:
   ups, missed this one, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476983522


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+      Comparator.comparing(UTF8String::toLowerCase),
+      "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collationTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collationTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collationTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);

Review Comment:
   Please, fix indentation.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+      Comparator.comparing(UTF8String::toLowerCase),
+      "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collationTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collationTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collationTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collationTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collationTable.length; i++) {
+      collationNameToIdMap.put(collationTable[i].collationName, i);
+    }
+  }
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public static int collationNameToId(String collationName) throws SparkException {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      Collation suggestion = Collections.min(List.of(collationTable), Comparator.comparingInt(
+        c -> UTF8String.fromString(c.collationName).levenshteinDistance(
+          UTF8String.fromString(normalizedName))));
+
+      Map<String, String> params = new HashMap<>();
+      params.put("collationName", collationName);
+      params.put("proposal", suggestion.collationName);
+
+      throw new SparkException(
+       "COLLATION_INVALID_NAME", SparkException.constructMessageParams(params), null);

Review Comment:
   ```suggestion
           "COLLATION_INVALID_NAME", SparkException.constructMessageParams(params), null);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476285238


##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types
+
+import org.apache.spark.SparkException
+// scalastyle:off
+import org.scalatest.funsuite.AnyFunSuite
+import org.scalatest.matchers.must.Matchers
+
+import org.apache.spark.sql.catalyst.util.CollationFactory._
+import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
+
+class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite
+  test("collationId stability") {
+    val ucsBasic = fetchCollation(0)
+    assert(ucsBasic.collationName == "UCS_BASIC")
+    assert(ucsBasic.isBinaryCollation)
+
+    val ucsBasicLcase = fetchCollation(1)
+    assert(ucsBasicLcase.collationName == "UCS_BASIC_LCASE")
+    assert(!ucsBasicLcase.isBinaryCollation)
+
+    val unicode = fetchCollation(2)
+    assert(unicode.collationName == "UNICODE")
+    assert(unicode.isBinaryCollation);
+
+    val unicodeCi = fetchCollation(3)
+    assert(unicodeCi.collationName == "UNICODE_CI")
+    assert(!unicodeCi.isBinaryCollation)
+  }
+
+  test("fetch invalid collation name") {
+    val error = intercept[SparkException] {
+      fetchCollation("UCS_BASIS")
+    }
+
+    assert(error.getMessage.contains(
+      "The value UCS_BASIS does not represent a correct collation name."))
+    assert(error.getMessage.contains("Suggested valid collation name: [UCS_BASIC]"))

Review Comment:
   Could you use `checkError`, please. This will allow to translate error messages to other languages in the future without modifying Spark's internal tests.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library Collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     * When using ICU Collator this version is exposed through collator.getVersion().
+     * Whenever the collation is updated, the version should be updated as well or kept
+     * for backwards compatibility.
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+        String collationName,
+        Collator collator,
+        Comparator<UTF8String> comparator,
+        String version,
+        ToLongFunction<UTF8String> hashFunction,
+        boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+        String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private static final Collation[] collationTable = new Collation[4];
+  private static final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  static {
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collationTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collationTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+            Comparator.comparing(UTF8String::toLowerCase), "1.0",

Review Comment:
   Could you fix indentation here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1476105579


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -451,6 +451,12 @@
     ],
     "sqlState" : "42704"
   },
+  "COLLATION_INVALID_NAME" : {
+    "message" : [
+      "The value <collationName> does not represent correct collation name."
+    ],
+    "sqlState" : "22018"

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44968:
URL: https://github.com/apache/spark/pull/44968#issuecomment-1927370292

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474553613


##########
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationFactorySuite.java:
##########
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.*;
+import static org.apache.spark.sql.catalyst.util.CollationFactory.*;
+import static org.apache.spark.unsafe.types.UTF8String.*;
+
+import java.util.Arrays;
+
+public class CollationFactorySuite {
+    @Test
+    public void collationIdStability() {
+      CollationInfo collationInfo = getInstance().fetchCollationInfo(0);
+      assertEquals(collationInfo.collationName, "UCS_BASIC");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(1);
+      assertEquals(collationInfo.collationName, "UCS_BASIC_LCASE");
+      assertFalse(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(2);
+      assertEquals(collationInfo.collationName, "UNICODE");
+      assertTrue(collationInfo.isBinaryCollation);
+
+      collationInfo = getInstance().fetchCollationInfo(3);
+      assertEquals(collationInfo.collationName, "UNICODE_CI");
+      assertFalse(collationInfo.isBinaryCollation);
+    }
+
+    @Test
+    public void collationFetchInvalidName() {
+      assertThrows(

Review Comment:
   I'd suggest we write test in Scala, so that we can extend `SparkFunSuite` and use convenient test utils like `checkError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474533443


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;

Review Comment:
   should this hash function return long? And for performance, we can use `ToLongFunction[UTF8String]` to avoid java boixing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474830118


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      ToLongFunction<UTF8String> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final Collation[] collatorTable;
+  private final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  private CollationFactory() {
+    collatorTable = new Collation[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+            Comparator.comparing(UTF8String::toLowerCase), "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) throws SparkException {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw new SparkException(
+        "COLLATION_INVALID_NAME", Collections.singletonMap("collationName", collationName));
+    }
+  }
+
+  public Collation fetchCollationInfo(int collationId) {

Review Comment:
   maybe refactor the names of these two methods as well if to match the new class name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474888223


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.ToLongFunction;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class Collation {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final ToLongFunction<UTF8String> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public Collation(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      ToLongFunction<UTF8String> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::equals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public Collation(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final Collation[] collatorTable;
+  private final HashMap<String, Integer> collationNameToIdMap = new HashMap<>();
+
+  private CollationFactory() {
+    collatorTable = new Collation[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new Collation(
+      "UCS_BASIC",
+      null,
+      UTF8String::compareTo,
+      "1.0",
+      s -> (long)s.hashCode(),
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new Collation(
+      "UCS_BASIC_LCASE",
+      null,
+            Comparator.comparing(UTF8String::toLowerCase), "1.0",
+      (s) -> (long)s.toLowerCase().hashCode(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new Collation(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
+    collatorTable[2].collator.setStrength(Collator.TERTIARY);
+
+
+    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength).
+    collatorTable[3] = new Collation(
+            "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
+    collatorTable[3].collator.setStrength(Collator.SECONDARY);
+
+    for (int i = 0; i < collatorTable.length; i++) {
+      this.collationNameToIdMap.put(collatorTable[i].collationName, i);
+    }
+  }
+
+  private static final CollationFactory instance = new CollationFactory();
+
+  /**
+   * Returns the collation id for the given collation name.
+   */
+  public int collationNameToId(String collationName) throws SparkException {
+    String normalizedName = collationName.toUpperCase();
+    if (collationNameToIdMap.containsKey(normalizedName)) {
+      return collationNameToIdMap.get(normalizedName);
+    } else {
+      throw new SparkException(
+        "COLLATION_INVALID_NAME", Collections.singletonMap("collationName", collationName));
+    }
+  }
+
+  public Collation fetchCollationInfo(int collationId) {

Review Comment:
   right, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46833][SQL] Collations - Introducing CollationFactory which provides comparison and hashing rules for supported collations [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #44968:
URL: https://github.com/apache/spark/pull/44968#discussion_r1474677628


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+import com.ibm.icu.util.ULocale;
+import com.ibm.icu.text.Collator;
+
+import org.apache.spark.SparkIllegalArgumentException;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string functions.
+ * Provides functionality to the UTF8String object which respects defined collation settings.
+ */
+public final class CollationFactory {
+  /**
+   * Entry encapsulating all information about a collation.
+   */
+  public static class CollationInfo {
+    public final String collationName;
+    public final Collator collator;
+    public final Comparator<UTF8String> comparator;
+
+    /**
+     * Version of the collation. This is the version of the ICU library used to create the collator.
+     * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
+     */
+    public final String version;
+
+    /**
+     * Collation sensitive hash function. Output for two UTF8Strings will be the same if they are
+     * equal according to the collation.
+     */
+    public final Function<UTF8String, Integer> hashFunction;
+
+    /**
+     * Potentially faster way than using comparator to compare two UTF8Strings for equality.
+     * Falls back to binary comparison if the collation is binary.
+     */
+    public final BiFunction<UTF8String, UTF8String, Boolean> equalsFunction;
+
+    /**
+     * Binary collation implies that UTF8Strings are considered equal only if they are
+     * byte for byte equal. All accent or case-insensitive collations are considered non-binary.
+     */
+    public final boolean isBinaryCollation;
+
+    public CollationInfo(
+      String collationName,
+      Collator collator,
+      Comparator<UTF8String> comparator,
+      String version,
+      Function<UTF8String, Integer> hashFunction,
+      boolean isBinaryCollation) {
+      this.collationName = collationName;
+      this.collator = collator;
+      this.comparator = comparator;
+      this.version = version;
+      this.hashFunction = hashFunction;
+      this.isBinaryCollation = isBinaryCollation;
+
+      if (isBinaryCollation) {
+        this.equalsFunction = UTF8String::binaryEquals;
+      } else {
+        this.equalsFunction = (s1, s2) -> this.comparator.compare(s1, s2) == 0;
+      }
+    }
+
+    /**
+     * Constructor with comparators that are inherited from the given collator.
+     */
+    public CollationInfo(
+      String collationName, Collator collator, String version, boolean isBinaryCollation) {
+      this(
+        collationName,
+        collator,
+        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
+        version,
+        s -> collator.getCollationKey(s.toString()).hashCode(),
+        isBinaryCollation);
+    }
+  }
+
+  private final CollationInfo[] collatorTable;
+  private final Hashtable<String, Integer> collationNameToIdMap = new Hashtable<>();
+
+  private CollationFactory() {
+    collatorTable = new CollationInfo[4];
+
+    // Binary comparison. This is the default collation.
+    // No custom comparators will be used for this collation.
+    // Instead, we rely on byte for byte comparison.
+    collatorTable[0] = new CollationInfo(
+      "UCS_BASIC",
+      null,
+      UTF8String::binaryCompare,
+      "1.0",
+      UTF8String::binaryHash,
+      true);
+
+    // Case-insensitive UTF8 binary collation.
+    // TODO: Do in place comparisons instead of creating new strings.
+    collatorTable[1] = new CollationInfo(
+      "UCS_BASIC_LCASE",
+      null,
+      (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0",
+      (s) -> s.toLowerCase().binaryHash(),
+      false);
+
+    // UNICODE case sensitive comparison (ROOT locale, in ICU).
+    collatorTable[2] = new CollationInfo(
+      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);

Review Comment:
   There is a comment in Collation class:
   ```
       /**
        * Version of the collation. This is the version of the ICU library used to create the collator.
        * For non-ICU collations (e.g. UTF8 Binary) the version is set to "1.0".
        */
       public final String version;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org