You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2023/12/29 12:30:09 UTC

[PR] Collations proof of concept [spark]

dbatomic opened a new pull request, #44537:
URL: https://github.com/apache/spark/pull/44537

# Rough POC for collations in Spark

## High level changes
- Collation suite that test currently supported features (start with this file).
- Global, singleton `CollatorFactory`. For given collation name or for given comparator id (cached representation of collation) it provides collation aware comparator that can be used by `UTF8String`. (proper design here is todo, it probably shouldn't be a singleton).
- `UTF8String` is extended with single integer that specifies collation. We could be even more aggressive and pack this integer into a short, or even a byte. Id represents cached comparator id that can be fetched from `CollatorFactory`.
- `UTF8String` respects this id for equality checks and compares.
- New type called `CollatedStringType` with physical type `PhysicalCollatedStringType` that at the end maps back to `UTF8String`. Basic support for this new type across code base.
- Support for aggregates, given that they currently rely on pure byte for byte comparison for group building.
- Support for merge join (hash based joins are TODO).
- POC uses java's default collator. TODO is to switch to ICU most likely. Collator changes are scoped to single file, so it shouldn't be hard to replace java's collator with ICU.

## Supported features at this point:
- `collate` expression -> input string is casted to `CollatedStringType` with given collation.
- Collation rules are java collator based. Caller provides locale and strength (primary, secondary, tertiary). E.g. `collate(input, 'sr-primary')` will collate input with Serbian locale that ignores both casing and accents. Secondary will ignore casing but respect accents and tertiary will respect both.
- `collation` expression -> returns collation name of given input.
- Support for basic operators (filters, aggregate, joins, views, inline tables etc.).

Proper testing (and creating real test strategy is TBD).

TBD is parquet and delta support, different collation levels (column level, table level, database level) and much more extensive testing of other features.

Suggestion for reviewers of this POC is to start with `CollationSuite` and newly tests in `UTF8StringSuite` to get the gist of the changes in this PR.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1441563199


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -207,14 +208,20 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
     }
   }
 
+  private def hasNoCollations(plan: LogicalPlan): Boolean = {
+    // TODO: Should be more selective here.
+    // Don't fully understand this code path.
+    !plan.expressions.exists(e => e.dataType.isInstanceOf[CollatedStringType])
+  }
+
   /**
    * To be able to prune partitions on a join key, the filtering side needs to
    * meet the following requirements:
    *   (1) it can not be a stream
    *   (2) it needs to contain a selective predicate used for filtering
    */
   private def hasPartitionPruningFilter(plan: LogicalPlan): Boolean = {
-    !plan.isStreaming && hasSelectivePredicate(plan)
+    !plan.isStreaming && hasSelectivePredicate(plan) && hasNoCollations(plan)

Review Comment:
   Yep, idea is to disable pruning altogether. At this point pruning part is still not stable and I also don't have any test coverage for pruning. You are right, need to take care of other "prunings" as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "milastdbx (via GitHub)" <gi...@apache.org>.

milastdbx commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1441518135


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -207,14 +208,20 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
     }
   }
 
+  private def hasNoCollations(plan: LogicalPlan): Boolean = {
+    // TODO: Should be more selective here.
+    // Don't fully understand this code path.
+    !plan.expressions.exists(e => e.dataType.isInstanceOf[CollatedStringType])
+  }
+
   /**
    * To be able to prune partitions on a join key, the filtering side needs to
    * meet the following requirements:
    *   (1) it can not be a stream
    *   (2) it needs to contain a selective predicate used for filtering
    */
   private def hasPartitionPruningFilter(plan: LogicalPlan): Boolean = {
-    !plan.isStreaming && hasSelectivePredicate(plan)
+    !plan.isStreaming && hasSelectivePredicate(plan) && hasNoCollations(plan)

Review Comment:
   Was your intention to disable pruning altogether ?
   
   This pruning is just for dynamic pruning.
   
   There is also pruning happening during file listing in both static analysis and scan phy nodes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1457160277


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1401,10 +1513,17 @@ public int compare(final UTF8String other) {
   @Override
   public boolean equals(final Object other) {
     if (other instanceof UTF8String o) {
-      if (numBytes != o.numBytes) {
-        return false;
+      if (comparatorId == 0)

Review Comment:
   same as for the above comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1457159311


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1390,8 +1497,13 @@ public UTF8String copy() {
 
   @Override
   public int compareTo(@Nonnull final UTF8String other) {
-    return ByteArray.compareBinary(
-        base, offset, numBytes, other.base, other.offset, other.numBytes);
+    if (comparatorId == 0) {

Review Comment:
   what if the `other` has nonzero comparatorId, should we do something like `-1 * other.compareTo(this)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1466438453


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala:
##########
@@ -256,11 +258,11 @@ trait ToStringBase { self: UnaryExpression with TimeZoneAwareExpression =>
       // as strings. Otherwise, the casting is using `BigDecimal.toString` which may use scientific
       // notation if an exponent is needed.
       case _: DecimalType if useDecimalPlainString =>
-        (c, evPrim) => code"$evPrim = UTF8String.fromString($c.toPlainString());"
-      case StringType =>
+        (c, evPrim) => code"$evPrim = UTF8String.fromString($c.toPlainString(), $collationId);"
+      case _: StringType =>
         (c, evPrim) => code"$evPrim = $c;"

Review Comment:
   this needs collationId as well, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] Collations proof of concept [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #44537:
URL: https://github.com/apache/spark/pull/44537#discussion_r1470942382


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala:
##########
@@ -231,11 +232,12 @@ trait ToStringBase { self: UnaryExpression with TimeZoneAwareExpression =>
              |$evPrim = $buffer.build();
            """.stripMargin
         }
-      case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx)
+      case pudt: PythonUserDefinedType =>
+        castToStringCode(pudt.sqlType, ctx, StringType.DEFAULT_COLLATION_ID)

Review Comment:
   why not use the specified `collationId`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org