You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "GideonPotok (via GitHub)" <gi...@apache.org> on 2024/03/10 22:20:26 UTC

[PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

GideonPotok opened a new pull request, #45453:
URL: https://github.com/apache/spark/pull/45453

### What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-46840

[Collation Support in Spark.docx](https://github.com/apache/spark/files/14551958/Collation.Support.in.Spark.docx)

### Why are the changes needed?

Work is underway to introduce collation concept into Spark. There is a need to build out a benchmarking suite to allow engineers to address performance impact.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"`

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1532591928


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   Do you mind if we keep them? I would tidy the code up if so. 
   
   You can make the call about whether to omit them in this initial PR, but before you decide, you should first check out the result file -- the E2E tests are already pretty useful and revealing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534718344


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   It's not blocking the approval from my side.
   
   I left a couple of additional comments. After that it should be green from my side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537987211


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   @cloud-fan how about this?
   ```suggestion
    * Benchmark to measure performance for operations on strings with the various collation types. To run this benchmark:
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   @cloud-fan how about this?
   ```suggestion
    * Benchmark to measure performance for operations on strings with the various collation types. To run this benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-1988594133

   @dbatomic @stefankandic Could you review this PR, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537987211


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   ```suggestion
    * Benchmark to measure performance for collations. To run this benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1531890997


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>

Review Comment:
   What is the point of slicing here? I am not sure if slicing is going to allocate new collection? Please make sure that benchmark loop doesn't do anything outside of core string operation that we are trying to benchmark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1539275734


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   I am making the change to numIters param to addCase now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536791565


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(i.toOctalString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(collationTypes: Seq[String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", 11, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val df = df1.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+
+          df.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()
+
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    collationBenchmarkFilterEqual(collationTypes.reverse)

Review Comment:
   Why doing `collationTypes.reverse` on collationBenchmarkFilterEqual and not in `benchmarkUTF8String`?
   
   We will probably be adding new benchmarks in the future so lets leave clear format here.
   1) All benchmarks should run against source that is provided in consistent way. Here `collationBenchmarkFilterEqual` generates data on it's own, while `benchmarkUTF8String` accepts data through `generateUTF8Strings`. I don't understand reason for this.
   
   2) Nit: Naming should be consistent. E.g. both should be `benchmarkFilterEqual` and `benchmarkUTFString`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-1988798774

   @GideonPotok - I think that better approach for benchmarking collation track is to start with the basics. e.g. unit benchmarks against `CollationFactory` +`UTF8String`. E.g. what is the perf diff between simple filter, without the rest of the spark stack, between UTF8_BINARY, UTF8_BINARY_LCASE, UNICODE and UNICODE_CI. After filter we can do the same for hashFunction. You should be able to just generate bunch of UTF8Stings and guide them through `comparator`/`hashFunction` of `Collations` in `CollationFactory`.
   
   That way benchmarking will be actionable. Starting immediately with joins is too high up and I think that we will not be able to do much with the results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537987211


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   ```suggestion
    * Benchmark to measure performance for Collations. To run this benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537386952


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".

Review Comment:
   Can you update destination file in the comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537858501


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", dfUncollated.count(), output = output)) {
+      (b, collationType) =>
+        val dfCollated = dfUncollated.selectExpr(
+          s"collate(s2, '$collationType') as k2_$collationType",
+          s"collate(s1, '$collationType') as k1_$collationType")
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+            .queryExecution.executedPlan.executeCollect()
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    benchmarkFilterEqual(collationTypes, generateDataframeInput(10000L))
+    benchmarkUTFString(collationTypes, generateSeqInput(10000L))

Review Comment:
   Can you find another example where we run two benchmarks in one benchmark file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1541127406


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - compareFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkUTFStringHashFunction(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - hashFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(
+      collationTypes: Seq[String],
+      dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark(
+        "filter df column with collation",
+        dfUncollated.count(),
+        warmupTime = 4.seconds,
+        output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))

Review Comment:
   Are you asking why we need that when benchmarkUTFStringEquals does that already? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2027383540

   @MaxGekk or @cloud-fan can you merge this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1531892031


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   IMO, just having `benchmarkUTFString` would be great starting point. We can work on E2E benchmarks later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2002555272

   @dbatomic @stefanbuk-db This PR  is ready for your initial review.
   
   Benchmark is queued to run in GHA, I will upload results to this branch once that finishes. 
   
    Here are some local results:
   ```
   [info] 13:51:04.324 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   [info] Running benchmark: filter df column with collation
   [info]   Running case: filter df column with collation - UNICODE_CI
   [info]   Stopped after 7 iterations, 2237 ms
   [info]   Running case: filter df column with collation - UNICODE
   [info]   Stopped after 26 iterations, 2040 ms
   [info]   Running case: filter df column with collation - UTF8_BINARY_LCASE
   [info]   Stopped after 9 iterations, 2148 ms
   [info]   Running case: filter df column with collation - UTF8_BINARY
   [info]   Stopped after 30 iterations, 2017 ms
   [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4
   [info] Apple M3 Max
   [info] filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] -----------------------------------------------------------------------------------------------------------------------------------
   [info] filter df column with collation - UNICODE_CI                   303            320          14          0.0   303345750.0       1.0X
   [info] filter df column with collation - UNICODE                       67             78           7          0.0    67441125.0       4.5X
   [info] filter df column with collation - UTF8_BINARY_LCASE            196            239          36          0.0   196200250.0       1.5X
   [info] filter df column with collation - UTF8_BINARY                   61             67           4          0.0    61342750.0       4.9X
   [info] Running benchmark: filter collation types
   [info]   Running case: filter - UTF8_BINARY
   [info]   Stopped after 349209 iterations, 2000 ms
   [info]   Running case: hashFunction - UTF8_BINARY
   [info]   Stopped after 262778 iterations, 2000 ms
   [info]   Running case: filter - UTF8_BINARY_LCASE
   [info]   Stopped after 36348 iterations, 2000 ms
   [info]   Running case: hashFunction - UTF8_BINARY_LCASE
   [info]   Stopped after 67744 iterations, 2000 ms
   [info]   Running case: filter - UNICODE
   [info]   Stopped after 276488 iterations, 2000 ms
   [info]   Running case: hashFunction - UNICODE
   [info]   Stopped after 13285 iterations, 2000 ms
   [info]   Running case: filter - UNICODE_CI
   [info]   Stopped after 40592 iterations, 2000 ms
   [info]   Running case: hashFunction - UNICODE_CI
   [info]   Stopped after 13140 iterations, 2000 ms
   [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4
   [info] Apple M3 Max
   [info] filter collation types:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] filter - UTF8_BINARY                                  0              0           0        255.4           3.9       1.0X
   [info] hashFunction - UTF8_BINARY                            0              0           0        160.0           6.3       0.6X
   [info] filter - UTF8_BINARY_LCASE                            0              0           0         53.1          18.8       0.2X
   [info] hashFunction - UTF8_BINARY_LCASE                      0              0           0         52.2          19.2       0.2X
   [info] filter - UNICODE                                      0              0           0        184.6           5.4       0.7X
   [info] hashFunction - UNICODE                                0              0           0         12.9          77.3       0.1X
   [info] filter - UNICODE_CI                                   0              0           0         44.4          22.5       0.2X
   [info] hashFunction - UNICODE_CI                             0              0           0         13.8          72.2       0.1X
   ```
   
   
   Let me know next steps. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534716771


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val N = 2 << 20
+
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val df = df1.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+
+          (0 to 10).foreach(_ =>
+          df.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  // How to benchmark "without the rest of the spark stack"?

Review Comment:
   Guess that you can remove this comment? `benchmarkUTFString` is doing the right thing already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534716179


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",

Review Comment:
   An interesting example would be to have longer strings + strings where difference comes at the very end.
   
   But let's keep that as follow up. We have enough work to do to get decent results on this benchmark :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538045707


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   Could you increase the number of iterations to have execution time at least > 0.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark("filter df column with collation", dfUncollated.count(), output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+          .queryExecution.executedPlan.executeCollect()

Review Comment:
   Could you elaborate a little but why do you use the `executeCollect()` action instead of writing to `noop`?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {

Review Comment:
   fix indentation, see https://github.com/databricks/scala-style-guide?tab=readme-ov-file#indent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1541221151


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - compareFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkUTFStringHashFunction(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - hashFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(
+      collationTypes: Seq[String],
+      dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark(
+        "filter df column with collation",
+        dfUncollated.count(),
+        warmupTime = 4.seconds,
+        output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))

Review Comment:
   We have `HashBenchmark` which does not test end-to-end queries either. I think it's fair to trust the system is reasonable and will translate expression level performance to end-to-end query performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45453: [SPARK-46840][SQL][TESTS]  Add `CollationBenchmark`
URL: https://github.com/apache/spark/pull/45453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538210182


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   Sure, I am happy to make such a change. I will warn you though that it will be different data used between the e2e benchmarks and the unit benchmarks. Which is fine with me but was a previously raised concern, if I understood correctly. @dbatomic Heads up that the two test suites will be using different size input data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538210182


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   Sure, I am happy to make such a change. I will warn you though that it will be different data used between the e2e benchmarks and the unit benchmarks. Which is fine with me but was a previously raised concern, if I understood correctly. @dbatomic Heads up that the two test suites will be using different size input data.



##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   @MaxGekk Or why don't I instead just see if I can switch the unit used to nanoseconds 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534987693


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1532589981


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val N = 4 << 20
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val map: Map[String, Column] = utf8Strings.map(_.toString).zipWithIndex.map{
+            case (s, i) =>
+              (s"s${i.toString}", expr(s"collate('${s}', '$collationType')"))
+          }.toMap
+
+          val df = spark
+            .range(N)
+            .withColumn("id_s", expr("cast(id as string)"))
+            .selectExpr((Seq("id_s", "id") ++ collationTypes.map(t =>
+              s"collate(id_s, '$collationType') as k_$t")): _*)
+//            .withColumn("k_lower", expr("lower(id_s)"))
+//            .withColumn("k_upper", expr("upper(id_s)"))
+            .withColumn("s0",
+  try_element_at(array(utf8Strings.map(_.toString).map(lit): _*),
+  functions.try_add(lit(1), pmod(col("id"), lit(utf8Strings.size)).cast("int"))))
+            .withColumn("s0", expr(s"collate(s0, '$collationType')"))
+          df.where(col(s"k_$collationType") === col(s"s0"))
+                .queryExecution.executedPlan.executeCollect()
+            //          .write.mode("overwrite").format("noop").save()
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  // How to benchmark "without the rest of the spark stack"?
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

Review Comment:
   My pleasure!  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534713817


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())

Review Comment:
   Can we keep things deterministic for now? Randomness here doesn't bring much value at this point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2029972677

   Thank you for adding this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2082859873

   reducing cardinality SGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2030064654

   Thanks @GideonPotok . Btw, we are also working on tightening perf so these 100x should soon move to 2-3x :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1527531246


##########
build/sbt:
##########
@@ -36,7 +36,7 @@ fi
 declare -r noshare_opts="-Dsbt.global.base=project/.sbtboot -Dsbt.boot.directory=project/.boot -Dsbt.ivy.home=project/.ivy"
 declare -r sbt_opts_file=".sbtopts"
 declare -r etc_sbt_opts_file="/etc/sbt/sbtopts"
-declare -r default_sbt_opts="-Xss64m"
+declare -r default_sbt_opts="-Xss1024m -Xms1024m -Xmx1024m -XX:ReservedCodeCacheSize=128m -XX:MaxMetaspaceSize=256m"

Review Comment:
   ```suggestion
   declare -r default_sbt_opts="-Xss64m"
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql
 
-import scala.collection.immutable.Seq
+// import scala.collection.immutable.Seq

Review Comment:
   ```suggestion
   import scala.collection.immutable.Seq
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534714245


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)

Review Comment:
   Why extra slice? Can benchmark just work on it's input?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534696860


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   @dbatomic let me know if removing the e2e changes is blocking approval, or just a nice-to-have?  I think this PR is all cleaned up and ready for another Review. LMK. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534986955


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())

Review Comment:
   Done
   



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1531887870


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)

Review Comment:
   Note prior to all comments - i am new to this benchmarking framework.
   
   collation fetch should be outside of benchmarking loop. We don't want to measure time taken to find collation object in the map from name. We only care about equality function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1539342857


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X
+collator.compare - UTF8_BINARY                        4              4           0          0.3        3717.9       0.0X
+hashFunction - UTF8_BINARY                            0              0           0         42.0          23.8       4.1X
+equalsFunction - UTF8_BINARY_LCASE                    5              5           0          0.2        5014.7       0.0X

Review Comment:
   The problem here is it does not make sense to compare these cases, like `hashFunction` and `equalsFunction`. We should create more `Benchmark` and each should test the same operation with different collation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1540259864


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - compareFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkUTFStringHashFunction(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - hashFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(
+      collationTypes: Seq[String],
+      dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark(
+        "filter df column with collation",
+        dfUncollated.count(),
+        warmupTime = 4.seconds,
+        output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))

Review Comment:
   do we really need this benchmark? Isn't just the equalsFunction?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {

Review Comment:
   ```suggestion
   
     def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2002522863

   @dbatomic I found that the best way to run the benchmarks is with GHA, so you can disregard the above point regarding performance running it locally. 
   
   GHA seems to be failing on unrelated org.apache.spark.sql.execution.benchmark.AggregateBenchmark. I will ask around about why that is.  https://github.com/GideonPotok/spark/actions/runs/8316439459
   ![image](https://github.com/apache/spark/assets/31429832/15e258d5-7f36-4605-b3eb-b95504b28d12)
   
   But please let me know if the code itself looks on the right track? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534709781


##########
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                    40             65          25          0.0     2001199.6       1.0X
+filter df column with collation - UNICODE                       19             28           7          0.0      958487.5       2.1X
+filter df column with collation - UTF8_BINARY_LCASE             15             18           4          0.0      773536.9       2.6X
+filter df column with collation - UTF8_BINARY                   14             16           4          0.0      683145.3       2.9X
+
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          1              1           0          1.9         520.0       1.0X
+collator.compare - UTF8_BINARY                        1              1           0          1.1         922.8       0.6X
+hashFunction - UTF8_BINARY                            3              3           0          0.3        3149.9       0.2X
+equalsFunction - UTF8_BINARY_LCASE                   77             79           5          0.0       77352.5       0.0X

Review Comment:
   If I am reading this correctly, UTF8_BINARY is ~100x faster than anything else, which is expected due to extra allocation + memory copy. We have some work to do in order to get other collation to be in 2-3x factor of utf8_binary.
   
   Btw, I also ran you benchmarks on my machine. Everything looks valid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538204774


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark("filter df column with collation", dfUncollated.count(), output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+          .queryExecution.executedPlan.executeCollect()

Review Comment:
   The issue I encountered with `noop()` was that it would hang indefinitely during local execution (for any benchmark I ran), at least with the default JVM settings. Interestingly, I didn't find that modifying `.jvmopts` to have an effect on the observed local JVM properties. I ultimately decided to sidestep the issue and just use `executeCollect`... 
   
   However, upon conducting tests in GitHub Actions (GHA) today, `noop()` functioned correctly, which was a new discovery for me (though I am still encountering the same behavior on my local machine). I did not realize it worked on GHA because by the time I had familiarized myself with the benchmarking process in GHA, I had already transitioned to using `executeCollect`. 
   
   I will switch back to employing `noop()` as I am aware it is the preferred choice in the codebase. However, I'm wondering if you can shed some light on why `noop` tends to be preferred? I would think that both tactics—utilizing executeCollect and executing no-op/in-memory write operations—are effectively identical. 
   
   I am aware that write is preferable, when benchmarking, to functions such as `count` or `show`, because Spark might optimize calls to those functions streamlining the query execution plan during a count operation, potentially omitting the precise transformation we intend to benchmark if deemed non-critical for producing the count outcome. Does employing executeCollect carry a similar threat of bypassing essential transformations as observed with count?
   
   For reference, here are the GHA test runs:
   - [GHA Test Run 1](https://github.com/GideonPotok/spark/actions/runs/8425725217)
   - [GHA Test Run 2](https://github.com/GideonPotok/spark/actions/runs/8425691894)
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538204774


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark("filter df column with collation", dfUncollated.count(), output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+          .queryExecution.executedPlan.executeCollect()

Review Comment:
   The issue I encountered with `noop()` was that it would hang indefinitely during local execution (for any benchmark I ran), at least with the default JVM settings. Interestingly, I didn't find that modifying `.jvmopts` to have an effect on the observed local JVM properties. I ultimately decided to sidestep the issue and just use `executeCollect`... 
   
   However, `noop()` does function correctly in GHA. So let's discuss switching to using `noop`, as I am aware it is the preferred choice in the codebase: I'm wondering if you can shed some light on why `noop` tends to be preferred? I would think that both tactics—utilizing executeCollect and executing no-op/in-memory write operations—are effectively identical. 
   
   I am aware that write is preferable, when benchmarking, to functions such as `count` or `show`, because Spark might optimize calls to those functions streamlining the query execution plan during a count operation, potentially omitting the precise transformation we intend to benchmark if deemed non-critical for producing the count outcome. Does employing executeCollect carry a similar threat of bypassing essential transformations as observed with count?
   
   For reference, here are the GHA test runs:
   - [GHA Test Run 1](https://github.com/GideonPotok/spark/actions/runs/8425725217)
   - [GHA Test Run 2](https://github.com/GideonPotok/spark/actions/runs/8425691894)
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537852758


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   it's not for joins, let's update the classdoc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538211202


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   @MaxGekk Or why don't I instead just see if I can switch the unit used to nanoseconds 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1539275734


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   I am making the change to numIters param to addCase now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1541176949


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - compareFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkUTFStringHashFunction(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - hashFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(
+      collationTypes: Seq[String],
+      dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark(
+        "filter df column with collation",
+        dfUncollated.count(),
+        warmupTime = 4.seconds,
+        output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))

Review Comment:
   It is a fair question. For one, it is good proof of concept that the performance differences at the internal unit level do translate to actual performance regressions when actual users use spark. Showing that improving the performance is actually going to be worth our time. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2029846119

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534696568


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(

Review Comment:
   For example. check out the data frame filtering speedup in performance for `UTF8_BINARY_LCASE` when going from Java 17 to 21 !
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1541440974


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.concurrent.duration._
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def",
+      "GHI", "ghi", "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFStringEquals(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - equalsFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.equalsFunction(s, s1).booleanValue())
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+  def benchmarkUTFStringCompare(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - compareFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkUTFStringHashFunction(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark(
+      "collation unit benchmarks - hashFunction",
+      utf8Strings.size * 10,
+      warmupTime = 4.seconds,
+      output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"$collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            (0 to 10).foreach(_ =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(
+      collationTypes: Seq[String],
+      dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark(
+        "filter df column with collation",
+        dfUncollated.count(),
+        warmupTime = 4.seconds,
+        output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))

Review Comment:
   @cloud-fan  Okay. I defer to you and @dbatomic as both of you wanted it removed.
   
   so I removed the e2e testing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2025233876

   @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2018610421

   @MaxGekk / @cloud-fan - made suggested changes. Can you do another round of review on your side and merge if everything looks fine?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537987211


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   ```suggestion
    * Benchmark to measure performance for operations on strings with the various collation types. To run this benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536790760


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(i.toOctalString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(collationTypes: Seq[String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", 11, output = output)) {

Review Comment:
   What 11 stands for? Can you use `d.length` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2029759042

   @cloud-fan @dbatomic @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2030162287

   @dbatomic which ones are 100x? Biggest differences I'm seeing in the benchmark numbers are around 10x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1531893691


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.filter(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.slice(0, 20).foreach(s1 =>
+            utf8Strings.sortBy(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)
+          utf8Strings.map(s => collation.hashFunction.applyAsLong(s))
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def collationBenchmarkFilterEqual(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val N = 4 << 20
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val map: Map[String, Column] = utf8Strings.map(_.toString).zipWithIndex.map{
+            case (s, i) =>
+              (s"s${i.toString}", expr(s"collate('${s}', '$collationType')"))
+          }.toMap
+
+          val df = spark
+            .range(N)
+            .withColumn("id_s", expr("cast(id as string)"))
+            .selectExpr((Seq("id_s", "id") ++ collationTypes.map(t =>
+              s"collate(id_s, '$collationType') as k_$t")): _*)
+//            .withColumn("k_lower", expr("lower(id_s)"))
+//            .withColumn("k_upper", expr("upper(id_s)"))
+            .withColumn("s0",
+  try_element_at(array(utf8Strings.map(_.toString).map(lit): _*),
+  functions.try_add(lit(1), pmod(col("id"), lit(utf8Strings.size)).cast("int"))))
+            .withColumn("s0", expr(s"collate(s0, '$collationType')"))
+          df.where(col(s"k_$collationType") === col(s"s0"))
+                .queryExecution.executedPlan.executeCollect()
+            //          .write.mode("overwrite").format("noop").save()
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  // How to benchmark "without the rest of the spark stack"?
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

Review Comment:
   Btw, thank you for doing this.
   I expect that we will need to to quite some work on collation perf polishing. Having data driven benchmarks is a great way to get us to improve the perf!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1531892512


##########
sql/core/benchmarks/InsertTableWithDynamicPartitionsBenchmark-jdk21-results.txt:
##########
@@ -1,8 +0,0 @@
-OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure

Review Comment:
   What is this file? These aren't results for CollationBenchmark?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2000575009

   @dbatomic It is coming along. 
   
   So far, Benchmark results are:
   
   ```
   OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4
   Apple M3 Max
   filter - UCS_BASIC_LCASE:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   filter - UCS_BASIC_LCASE                              0              0           0         70.4          14.2       1.0X
   
   OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4
   Apple M3 Max
   hashFunction - UCS_BASIC_LCASE:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   hashFunction - UCS_BASIC_LCASE                        0              0           0         92.3          10.8       1.0X
   ```
   There would be a few more sets of results, but the process is thrashing, maybe. I am working through the run getting frozen after finishing `benchmarkHashFunction`. The process is not running out of heap space. CPU usage is effectively 0% which I would think would mean the process is stuck on IO.
   
   Any ideas?
   
   I am having the same sort of performance hanging when trying to run `org.apache.spark.sql.execution.benchmark.InsertTableWithDynamicPartitionsBenchmark` just to see if the problem was specific to the newly created tests...
   
   ![image](https://github.com/apache/spark/assets/31429832/d366dc79-2d89-4deb-9c43-adf942460605)
   ![image](https://github.com/apache/spark/assets/31429832/eac3aedd-7295-47b2-9f18-3a3772b208bc)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538466180


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark("filter df column with collation", dfUncollated.count(), output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+          .queryExecution.executedPlan.executeCollect()

Review Comment:
   `executeCollect` has extra overhead to materialize and buffer the rows on the driver side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538037516


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", dfUncollated.count(), output = output)) {
+      (b, collationType) =>
+        val dfCollated = dfUncollated.selectExpr(
+          s"collate(s2, '$collationType') as k2_$collationType",
+          s"collate(s1, '$collationType') as k1_$collationType")
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+            .queryExecution.executedPlan.executeCollect()
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    benchmarkFilterEqual(collationTypes, generateDataframeInput(10000L))
+    benchmarkUTFString(collationTypes, generateSeqInput(10000L))

Review Comment:
   `DateTimeRebaseBenchmark`, `CharVarcharBenchmark`, and `ByteArrayBenchmark` are examples. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537399607


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", dfUncollated.count(), output = output)) {
+      (b, collationType) =>
+          val dfCollated = dfUncollated.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()

Review Comment:
   fix indentation



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )

Review Comment:
   fix indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2017715401

   LGTM (+one minor comment).
   
   @MaxGekk / @cloud-fan - can you do review on your side and merge if everything looks fine?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534715260


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),

Review Comment:
   Why the input for different inputs between `collationBenchmarkFilterEqual` and `benchmarkUTFString`?
   
   I think that it is fine to use same input.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534987567


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",

Review Comment:
   Yep, a good follow up :) 



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(
+      collationTypes: Seq[String],
+      utf8Strings: Seq[UTF8String]): Unit = {
+    val N = 2 << 20
+
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val df = df1.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+
+          (0 to 10).foreach(_ =>
+          df.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  // How to benchmark "without the rest of the spark stack"?

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536790280


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(i.toOctalString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(collationTypes: Seq[String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", 11, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val df = df1.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+
+          df.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()
+

Review Comment:
   please remove extra line here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536790258


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),

Review Comment:
   Option is also to completely remove `collationBenchmarkFilterEqual` since it is not providing much value at this point. We should concentrate on improving perf on `benchmarkUTFString`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1539275734


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   ah, I see, numIterations can be hardcoded. I misunderstood, thought you wanted a bigger data frame. I am making that change now,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1539418002


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X
+collator.compare - UTF8_BINARY                        4              4           0          0.3        3717.9       0.0X
+hashFunction - UTF8_BINARY                            0              0           0         42.0          23.8       4.1X
+equalsFunction - UTF8_BINARY_LCASE                    5              5           0          0.2        5014.7       0.0X

Review Comment:
   no problem
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536790142


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),

Review Comment:
   I would prefer to do this in this PR. There is no reason to hurry with this change, given that we are still functionally stabilizing collation space. So, let's avoid follow ups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536889413


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),

Review Comment:
   Done, inputs are the now same. Let me know what you think.
   
   Local results are as follows (It shows that UTF8_BINARY and UNICODE have very fast equals function relative to the other collations):
   
   
   ```
   [info] filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] -----------------------------------------------------------------------------------------------------------------------------------
   [info] filter df column with collation - UTF8_BINARY_LCASE              5              6           1          1.9         528.3       1.0X
   [info] filter df column with collation - UNICODE                        4              5           1          2.3         427.4       1.2X
   [info] filter df column with collation - UTF8_BINARY                    5              5           1          2.2         464.2       1.1X
   [info] filter df column with collation - UNICODE_CI                     5              6           1          1.9         531.1       1.0X
   [info] Running benchmark: collation unit benchmarks
   [info]   Running case: equalsFunction - UTF8_BINARY_LCASE
   [info]   Stopped after 2 iterations, 3897 ms
   [info]   Running case: collator.compare - UTF8_BINARY_LCASE
   [info]   Stopped after 2 iterations, 4164 ms
   [info]   Running case: hashFunction - UTF8_BINARY_LCASE
   [info]   Stopped after 2 iterations, 3782 ms
   [info]   Running case: equalsFunction - UNICODE
   [info]   Stopped after 6 iterations, 2142 ms
   [info]   Running case: collator.compare - UNICODE
   [info]   Stopped after 2 iterations, 7718 ms
   [info]   Running case: hashFunction - UNICODE
   [info]   Stopped after 2 iterations, 16612 ms
   [info]   Running case: equalsFunction - UTF8_BINARY
   [info]   Stopped after 7 iterations, 2088 ms
   [info]   Running case: collator.compare - UTF8_BINARY
   [info]   Stopped after 4 iterations, 2112 ms
   [info]   Running case: hashFunction - UTF8_BINARY
   [info]   Stopped after 3 iterations, 2324 ms
   [info]   Running case: equalsFunction - UNICODE_CI
   [info]   Stopped after 2 iterations, 8298 ms
   [info]   Running case: collator.compare - UNICODE_CI
   [info]   Stopped after 2 iterations, 6933 ms
   [info]   Running case: hashFunction - UNICODE_CI
   [info]   Stopped after 2 iterations, 12882 ms
   [info] OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.4
   [info] Apple M3 Max
   [info] collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] equalsFunction - UTF8_BINARY_LCASE                 1948           1949           1          0.0      194813.5       1.0X
   [info] collator.compare - UTF8_BINARY_LCASE               2081           2082           1          0.0      208136.3       0.9X
   [info] hashFunction - UTF8_BINARY_LCASE                   1890           1891           2          0.0      189021.8       1.0X
   [info] equalsFunction - UNICODE                            357            357           0          0.0       35675.0       5.5X
   [info] collator.compare - UNICODE                         3848           3859          16          0.0      384793.0       0.5X
   [info] hashFunction - UNICODE                             8304           8306           3          0.0      830445.5       0.2X
   [info] equalsFunction - UTF8_BINARY                        296            298           2          0.0       29608.1       6.6X
   [info] collator.compare - UTF8_BINARY                      528            528           0          0.0       52779.0       3.7X
   [info] hashFunction - UTF8_BINARY                          773            775           1          0.0       77336.4       2.5X
   [info] equalsFunction - UNICODE_CI                        4141           4149          12          0.0      414060.1       0.5X
   [info] collator.compare - UNICODE_CI                      3461           3467           8          0.0      346055.8       0.6X
   [info] hashFunction - UNICODE_CI                          6418           6441          33          0.0      641794.3       0.3X
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1540047223


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X
+collator.compare - UTF8_BINARY                        4              4           0          0.3        3717.9       0.0X
+hashFunction - UTF8_BINARY                            0              0           0         42.0          23.8       4.1X
+equalsFunction - UTF8_BINARY_LCASE                    5              5           0          0.2        5014.7       0.0X

Review Comment:
   Done!
   



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {
+    val benchmark =
+      new Benchmark("filter df column with collation", dfUncollated.count(), output = output)
+    collationTypes.foreach(collationType => {
+      val dfCollated = dfUncollated.selectExpr(
+        s"collate(s2, '$collationType') as k2_$collationType",
+        s"collate(s1, '$collationType') as k1_$collationType")
+      benchmark.addCase(s"filter df column with collation - $collationType") { _ =>
+        dfCollated.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+          .queryExecution.executedPlan.executeCollect()

Review Comment:
   K. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1540048010


##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                   403            463          39          0.0    20147470.0       1.0X
+filter df column with collation - UNICODE                      187            223          37          0.0     9339586.0       2.2X
+filter df column with collation - UTF8_BINARY_LCASE            426            434           7          0.0    21300903.4       0.9X
+filter df column with collation - UTF8_BINARY                  188            199           5          0.0     9403169.1       2.1X
+
+OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          0              0           0         10.4          96.6       1.0X

Review Comment:
   Done. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1536886024


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(i.toOctalString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(_ =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),
+      ("GHI", "GHI"), ("ghi", "IG"), ("JKL", "NOP"), ("jkl", "LKJ"), ("mnO", "MNO"),
+      ("hello", "hola"))
+    ).toDF("s1", "s2")
+    d
+  }
+
+  def collationBenchmarkFilterEqual(collationTypes: Seq[String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"filter df column with collation", 11, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"filter df column with collation - $collationType") { _ =>
+          val df = df1.selectExpr(
+            s"collate(s2, '$collationType') as k2_$collationType",
+            s"collate(s1, '$collationType') as k1_$collationType")
+
+          df.where(col(s"k1_$collationType") === col(s"k2_$collationType"))
+                .queryExecution.executedPlan.executeCollect()
+
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    collationBenchmarkFilterEqual(collationTypes.reverse)

Review Comment:
   Good points. Let me know if you are happy with my changes, which I believe incorporated this feedback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537987211


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:

Review Comment:
   @cloud-fan how about this?
   ```suggestion
    * Benchmark to measure performance for operations on strings with the various collation types. To run this benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1537855722


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {

Review Comment:
   This looks weird, as `Benchmark.addCase` returns Unit and we are using `foldLeft` here. I think it's better to have
   ```
   val benchmark = new Benchmark...
   collationTypes.foreach ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1538163900


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for comparisons between collated strings. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/CollationBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
+
+  def generateSeqInput(n: Long): Seq[UTF8String] = {
+    val input = Seq("ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def",
+      "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ",
+      "ABC", "ABC", "aBC", "aBC", "abc", "abc", "DEF", "DEF", "def", "def", "GHI", "ghi",
+      "JKL", "jkl", "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ")
+      .map(UTF8String.fromString)
+    val inputLong: Seq[UTF8String] = (0L until n).map(i => input(i.toInt % input.size))
+    inputLong
+  }
+
+  private def getDataFrame(strings: Seq[String]): DataFrame = {
+    val asPairs = strings.sliding(2, 1).toSeq.map {
+      case Seq(s1, s2) => (s1, s2)
+    }
+    val d = spark.createDataFrame(asPairs).toDF("s1", "s2")
+    d
+  }
+
+  private def generateDataframeInput(l: Long): DataFrame = {
+    getDataFrame(generateSeqInput(l).map(_.toString))
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings
+
+    val benchmark = new Benchmark("collation unit benchmarks", utf8Strings.size, output = output)
+    collationTypes.foreach(collationType => {
+      val collation = CollationFactory.fetchCollation(collationType)
+      benchmark.addCase(s"equalsFunction - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+          )
+        )
+      }
+      benchmark.addCase(s"collator.compare - $collationType") { _ =>
+        sublistStrings.foreach(s1 =>
+          utf8Strings.foreach(s =>
+            collation.comparator.compare(s, s1)
+          )
+        )
+      }
+      benchmark.addCase(s"hashFunction - $collationType") { _ =>
+        sublistStrings.foreach(_ =>
+          utf8Strings.foreach(s =>
+            collation.hashFunction.applyAsLong(s)
+          )
+        )
+      }
+    }
+    )
+    benchmark.run()
+  }
+
+  def benchmarkFilterEqual(collationTypes: Seq[String],
+                           dfUncollated: DataFrame): Unit = {

Review Comment:
   Done.
   
   Do you happen to have a scalafmt.conf file or IntelliJ codeStyleConfig.xml file corresponding to the full Databricks Scala style guide? I prefer to have this happen automatically. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK 46840] Add sql.execution.benchmark.CollationBenchmark.scala Scaffolding [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-1989191094

   > @GideonPotok - I think that better approach for benchmarking collation track is to start with the basics. e.g. unit benchmarks against `CollationFactory` +`UTF8String`. E.g. what is the perf diff between simple filter, without the rest of the spark stack, between UTF8_BINARY, UTF8_BINARY_LCASE, UNICODE and UNICODE_CI. After filter we can do the same for hashFunction. You should be able to just generate bunch of UTF8Stings and guide them through `comparator`/`hashFunction` of `Collations` in `CollationFactory`.
   > 
   > That way benchmarking will be actionable. Starting immediately with joins is too high up and I think that we will not be able to do much with the results.
   
   @dbatomic that is extremely helpful thank you. will do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534694307


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{functions, Column}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          val collation = CollationFactory.fetchCollation(collationType)

Review Comment:
   @dbatomic done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534987486


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.{DataFrame}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.functions._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Benchmark to measure performance for joins. To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class>
+ *        --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ *   2. build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>"
+ *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
+ * }}}
+ */
+
+object CollationBenchmark extends SqlBasedBenchmark {
+  private val collationTypes = Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", "UNICODE_CI")
+
+  def generateUTF8Strings(n: Int): Seq[UTF8String] = {
+    // Generate n UTF8Strings
+    Seq("ABC", "aBC", "abc", "DEF", "def", "GHI", "ghi", "JKL", "jkl",
+      "MNO", "mno", "PQR", "pqr", "STU", "stu", "VWX", "vwx", "YZ").map(UTF8String.fromString) ++
+    (18 to n).map(i => UTF8String.fromString(Random.nextString(i % 25))).sortBy(_.hashCode())
+  }
+
+  def benchmarkUTFString(collationTypes: Seq[String], utf8Strings: Seq[UTF8String]): Unit = {
+    val sublistStrings = utf8Strings.slice(0, 200)
+    val benchmark = collationTypes.foldLeft(
+      new Benchmark(s"collation unit benchmarks", utf8Strings.size, output = output)) {
+      (b, collationType) =>
+        val collation = CollationFactory.fetchCollation(collationType)
+        b.addCase(s"equalsFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+            collation.equalsFunction(s, s1).booleanValue()
+            )
+          )
+        }
+        b.addCase(s"collator.compare - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.comparator.compare(s, s1)
+            )
+          )
+        }
+        b.addCase(s"hashFunction - $collationType") { _ =>
+          sublistStrings.foreach(s1 =>
+            utf8Strings.foreach(s =>
+              collation.hashFunction.applyAsLong(s)
+            )
+          )
+        }
+        b
+    }
+    benchmark.run()
+  }
+
+  def df1: DataFrame = {
+    val d = spark.createDataFrame(Seq(
+      ("ABC", "ABC"), ("aBC", "abc"), ("abc", "ABC"), ("DEF", "DEF"), ("def", "DEF"),

Review Comment:
   I missed this comment. Do you mind if we merge as-is? I already reran benchmarks and updated files. Can make the inputs the same in a followup, if that sounds good? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2015575754

   @dbatomic  @stefanbuk-db Please approve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.

yaooqinn commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-2082071977

   Hey guys,
   
   I am currently regenerating the complete benchmark result with 20 jobs running simultaneously. Each job usually takes around 10 to 30 minutes to complete. However, the job that includes the new benchmark is taking an exceptionally long time, having been running for 2.5 hours as of 2024-04-29 15:38:56 (Shanghai time) and is still not finished. I am wondering if we should reduce the cardinality to potentially speed up the process.
   
   https://github.com/yaooqinn/spark/actions/runs/8872875176/job/24357926521
   
   
   ```log
   Running org.apache.spark.sql.execution.benchmark.CollationBenchmark:
   Running benchmark: collation unit benchmarks - equalsFunction
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 13865 ms
     Running case: UNICODE
     Stopped after 2 iterations, 8754 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 8723 ms
     Running case: UNICODE_CI
     Stopped after 2 iterations, 92410 ms
   OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
   AMD EPYC 7763 64-Core Processor
   collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------
   UTF8_BINARY_LCASE                                    6931           6933           2          0.0       69310.8       1.0X
   UNICODE                                              4355           4377          32          0.0       43547.4       1.6X
   UTF8_BINARY                                          4359           4362           3          0.0       43592.6       1.6X
   UNICODE_CI                                          46188          46205          24          0.0      461878.5       0.2X
   Running benchmark: collation unit benchmarks - compareFunction
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 13838 ms
     Running case: UNICODE
     Stopped after 2 iterations, 92542 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 16151 ms
     Running case: UNICODE_CI
     Stopped after 2 iterations, 97297 ms
   OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
   AMD EPYC 7763 64-Core Processor
   collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------
   UTF8_BINARY_LCASE                                     6912           6919          10          0.0       69119.6       1.0X
   UNICODE                                              46242          46271          42          0.0      462416.1       0.1X
   UTF8_BINARY                                           8071           8076           6          0.0       80713.3       0.9X
   UNICODE_CI                                           48626          48649          32          0.0      486262.5       0.1X
   Running benchmark: collation unit benchmarks - hashFunction
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 23280 ms
     Running case: UNICODE
     Stopped after 2 iterations, 373689 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 19852 ms
     Running case: UNICODE_CI
     Stopped after 2 iterations, 318401 ms
   OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
   AMD EPYC 7763 64-Core Processor
   collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   UTF8_BINARY_LCASE                                 11611          11640          41          0.0      116108.1       1.0X
   UNICODE                                          186807         186845          53          0.0     1868069.7       0.1X
   UTF8_BINARY                                        9896           9926          43          0.0       98959.6       1.2X
   UNICODE_CI                                       159154         159201          66          0.0     1591543.6       0.1X
   Running benchmark: collation unit benchmarks - contains
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 66470 ms
     Running case: UNICODE
     Stopped after 2 iterations, 36300 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 40600 ms
     Running case: UNICODE_CI
     Stopped after 2 iterations, 1821495 ms
   OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
   AMD EPYC 7763 64-Core Processor
   collation unit benchmarks - contains:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   UTF8_BINARY_LCASE                                 33224          33235          17          0.0      332236.8       1.0X
   UNICODE                                           18132          18150          26          0.0      181316.5       1.8X
   UTF8_BINARY                                       20296          20300           6          0.0      202959.6       1.6X
   UNICODE_CI                                       905591         910748        7293          0.0     9055905.0       0.0X
   Running benchmark: collation unit benchmarks - startsWith
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 65119 ms
     Running case: UNICODE
     Stopped after 2 iterations, [3488](https://github.com/yaooqinn/spark/actions/runs/8872875176/job/24357926521#step:7:3488)9 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 39591 ms
     Running case: UNICODE_CI
     Stopped after 2 iterations, 1777884 ms
   OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
   AMD EPYC 7763 64-Core Processor
   collation unit benchmarks - startsWith:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   UTF8_BINARY_LCASE                                 32524          32560          51          0.0      325239.2       1.0X
   UNICODE                                           17439          17445           7          0.0      174393.7       1.9X
   UTF8_BINARY                                       19756          19796          57          0.0      197556.5       1.6X
   UNICODE_CI                                       888925         888942          24          0.0     8889250.7       0.0X
   Running benchmark: collation unit benchmarks - endsWith
     Running case: UTF8_BINARY_LCASE
     Stopped after 2 iterations, 65636 ms
     Running case: UNICODE
     Stopped after 2 iterations, 35204 ms
     Running case: UTF8_BINARY
     Stopped after 2 iterations, 39829 ms
     Running case: UNICODE_CI
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org