You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "GideonPotok (via GitHub)" <gi...@apache.org> on 2024/03/27 14:30:58 UTC

[PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

GideonPotok opened a new pull request, #45738:
URL: https://github.com/apache/spark/pull/45738

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2048962946

   @GideonPotok You are correct, this refactor should not greatly affect your current PR in particular - I expect you'll only need to refactor testing a bit (shouldn't be too much work)
   
   Feel free to move over to the next ticket, and I'll let you know when and how to consolidate your code/design so we can move to final review and sign off on this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45738:
URL: https://github.com/apache/spark/pull/45738#discussion_r1555192610


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -86,6 +89,43 @@ class CollationStringExpressionsSuite extends QueryTest
     testRepeat("UNICODE_CI", 3, "abc", 2)
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   Sorry @uros-db, I didn't realize you wanted me to add the additional edge cases to this suite specifically. Are you sure you want me to move _all_ the new tests to `CollationStringExpressionsSuite`?     



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2046822456

> did you try to run this suite locally and investigate any potential issues?

@uros-db I would love to. I need to fix my setup though, first -- there is an issue I have been putting off where any unit tests (in the spark codebase) that make use of the local file system seems to hang indefinitely when run locally: any tests that use `.noop` write, that use `CREATE TABLE... USING PARQUET`, or when running this suite in particular (`org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite`). I would love to know how you recommend that I look into figuring out the root cause, so I can fix my local setup...

I usually run from sbt directly. I can provide additional setup details, eg my jvmopts and sbt jvm opts. I have also done some cursory JVM performance profiling and can provide some of those measurements.

Let me know what to look into first. Thanks so much.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2053959041

   @ur.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2039952629

   also, use `CollationStringExpressionsSuite` instead of `CollationSuite` for these tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2048919774

   @uros-db this is ready for review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45738:
URL: https://github.com/apache/spark/pull/45738#discussion_r1554838890


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -86,6 +89,43 @@ class CollationStringExpressionsSuite extends QueryTest
     testRepeat("UNICODE_CI", 3, "abc", 2)
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   I think this will do it, thanks! Now just move over all tests to `CollationStringExpressionsSuite`, re-run any failing CI checks, and modify the PR title so we can open this up for review to Spark committers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2032029394

   @uros-db I think that did the trick! I will let you know when to review again. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2048941620

   @uros-db No problem at all.
   
    if I understand your refactor correctly, my changes will basically either stay in the same place or move to the new `common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java`, right? 
   
   I will in the meantime take care of implementing https://issues.apache.org/jira/browse/SPARK-47412 and will also then put that on hold until after the refactor is merged. If that sounds good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok closed pull request #45738: [SPARK-47413][SQL] - add support to substr/left/right for collations  
URL: https://github.com/apache/spark/pull/45738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2030699757

   since the "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.MISMATCHED_TYPES" problem seems to be in `Right.replacement` expression when doing `Literal(value, dataType)`, did you try something like `Literal(UTF8String.EMPTY_UTF8, str.dataType)` to match these datatypes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2046941212

   > I don't think `CSVLegacyTimeParserSuite` is related to you, but it would probably be a very good idea to setup Maven so that you can run/debug all tests locally in general
   
   Same issue running tests locally with maven. I always build with maven and when running tests with it (Eg `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite test`), i encounter the same hanging on anything that is writing to file system. I can keep looking into it.
   
   I agree that my changes are probably not what is causing `org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite` to fail in GHA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2046875237

   I don't think `CSVLegacyTimeParserSuite` is related to you, but it would probably be a very good idea to setup Maven so that you can run/debug all tests locally in general


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2047328959

   @uros-db I got the file-writing tests to work locally when I simply `export SPARK_HOME=/Users/gideon/repos/spark` prior to running my maven tests. 
   
   More importantly, All GHA are passing. So this is ready for reviewers to begin to review. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45738:
URL: https://github.com/apache/spark/pull/45738#discussion_r1549911287


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -86,6 +89,43 @@ class CollationStringExpressionsSuite extends QueryTest
     testRepeat("UNICODE_CI", 3, "abc", 2)
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   try covering some more edge cases in these tests, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc.
   
   good example would be this PR: https://github.com/apache/spark/pull/45749



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on code in PR #45738:
URL: https://github.com/apache/spark/pull/45738#discussion_r1553981850


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -86,6 +89,43 @@ class CollationStringExpressionsSuite extends QueryTest
     testRepeat("UNICODE_CI", 3, "abc", 2)
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   @uros-db Please see what I have since added and let me know whether what to keep going, or whether we have enough!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2045605623

   @uros-db this would be ready for review but spark sql unit tests repeatedly fail for CSVLegacyTimeParserSuite, which I did not modify. I have reran it a total of four times. Any idea how to get this check to pass successfully?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2030513193

@uros-db

I have some tests, and have substring implementation basically passing tests.
However, my redefined implementations of Left and Right are failing my tests. They are currently throwing the following exceptions in my new test cases in CollationSuite.scala:

```
[COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.MISMATCHED_TYPES] Cannot process input data types for the expression: "(IF((1 <= 0), , substring(collate(klm), (- 1), 2147483647)))". All input types must be the same except nullable, containsNull, valueContainsNull flags, but found the input types ["STRING", "STRING COLLATE UTF8_BINARY_LCASE"]. SQLSTATE: 42K09
```

I think it is caused by additional work needed on the overridden implementation of `replacement` within `Left` and `Right` in `stringExpressions.scala`. I think that because the second parameter to Literal is DataType in the replacement, rather than AbstractDataType, there is this little issue. I think that is what is causing these tests to fail. What do you think? Please advise on how to debug this issue.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2048944931

   PS: Do you think changes, such as these, which are only to implementations of `inputTypes` and `replacement`, which do not rely on calling UTFString or CollationFactory, will need to be modified? Even if they won't, it makes sense for me to hold off.  But I would still like to know how you see this changing.
   
   Thank you, kind sir :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr for collations [spark]

Posted by "GideonPotok (via GitHub)" <gi...@apache.org>.

GideonPotok commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2034795190

   @uros-db what do you think of it? Anything else you want me to test for? I also did some ad-hoc testing of these functions in spark shell and pyspark shell, all looks good..
   
   ```
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
         /_/
            
   Using Scala version 2.13.13 (OpenJDK 64-Bit Server VM, Java 17.0.10)
   >....                                                                                                                                                                                                                                                                                                              
        |  import org.apache.spark.sql._
        |  import org.apache.spark.sql.functions._
        |  import org.apache.spark.unsafe.types.UTF8String
        |  val df = spark.range(1000 * 1000).toDF("id")
        | 
        |  val dff = df.withColumn("random_string", lit(UTF8String.fromString(Random.nextString(25))))
        |  dff.show(1, 200, true)
        |  val dfff = dff.withColumn("my_substring", substring(col("random_string"), 5, 5))
        |  .withColumn("my_leftstring", left(col("random_string"), lit(5)))
        |  .withColumn("my_rightstring", right(col("random_string"), lit(5)))
        |     .withColumn("my_concat", concat(col("my_substring"), col("my_leftstring"), col("my_rightstring")))
        | 
        |     dfff.show(1, 50, true)
   -RECORD 0----------------------------------------------------
    id            | 0                                           
    random_string | ᯻箲䨂䡶栣䌳벣ሰ‸ꠂ昊룃쉳믠䭐沂各䬙j穯೑顆첤鞅ల 
   only showing top 1 rows
   
   -RECORD 0-----------------------------------------------------
    id             | 0                                           
    random_string  | ᯻箲䨂䡶栣䌳벣ሰ‸ꠂ昊룃쉳믠䭐沂各䬙j穯೑顆첤鞅ల 
    my_substring   | 栣䌳벣ሰ‸                                    
    my_leftstring  | ᯻箲䨂䡶栣                                   
    my_rightstring | ೑顆첤鞅ల                                    
    my_concat      | 栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల                                
   only showing top 1 rows
   
   import scala.util.Random
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions._
   import org.apache.spark.unsafe.types.UTF8String
   val df: org.apache.spark.sql.DataFrame = [id: bigint]
   val dff: org.apache.spark.sql.DataFrame = [id: bigint, random_string: string]
   val dfff: org.apache.spark.sql.DataFrame = [id: bigint, random_string: string ... 4 more fields]
   
   scala>    dfff.show(1, 50, true)
        |     dfff.registerTempTable("mytable")
        |     spark.sql("select * from mytable").show(1)
        |     spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
        |     spark.sql("select COLLATION( random_string), COLLATION( my_substring), COLLATION( my_leftstring), COLLATION( my_rightstring), COLLATION( my_concat) from mytable").show(1)
   warning: 1 deprecation (since 2.0.0); for details, enable `:setting -deprecation` or `:replay -deprecation`
   -RECORD 0-----------------------------------------------------
    id             | 0                                           
    random_string  | ᯻箲䨂䡶栣䌳벣ሰ‸ꠂ昊룃쉳믠䭐沂各䬙j穯೑顆첤鞅ల 
    my_substring   | 栣䌳벣ሰ‸                                    
    my_leftstring  | ᯻箲䨂䡶栣                                   
    my_rightstring | ೑顆첤鞅ల                                    
    my_concat      | 栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల                                    
   only showing top 1 rows
   
   +---+---------------------------------+------------+-------------+--------------+-------------------------+
   | id|                    random_string|my_substring|my_leftstring|my_rightstring|                my_concat|
   +---+---------------------------------+------------+-------------+--------------+-------------------------+
   |  0|᯻箲䨂䡶栣䌳벣ሰ‸ꠂ昊룃쉳믠䭐沂各...|    栣䌳벣ሰ‸|    ᯻箲䨂䡶栣|      ೑顆첤鞅ల|栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల|
   +---+---------------------------------+------------+-------------+--------------+-------------------------+
   only showing top 1 rows
   
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   |collation(random_string)|collation(my_substring)|collation(my_leftstring)|collation(my_rightstring)|collation(my_concat)|
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   |             UTF8_BINARY|            UTF8_BINARY|             UTF8_BINARY|              UTF8_BINARY|         UTF8_BINARY|
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   only showing top 1 rows
   
   
   scala>     spark.sql("select collate(my_concat, 'unicode') as unicode_myconcat, collate(my_concat, 'utf8_binary') as utf8_binary_myconcat, collate(my_concat, 'utf8_binary_lcase') as utf8_binary_lcase_myconcat, collate(my_concat, 'unicode_ci') as unicode_ci_myconcat from mytable").show(1)
        | 
   +-------------------------+-------------------------+--------------------------+-------------------------+
   |         unicode_myconcat|     utf8_binary_myconcat|utf8_binary_lcase_myconcat|      unicode_ci_myconcat|
   +-------------------------+-------------------------+--------------------------+-------------------------+
   |栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల|栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల| 栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల|栣䌳벣ሰ‸᯻箲䨂䡶栣೑顆첤鞅ల|
   +-------------------------+-------------------------+--------------------------+-------------------------+
   only showing top 1 rows
   
   
   scala>     spark.sql("select left(collate(my_concat, 'unicode'), 5) as unicode_myconcat, right(collate(my_concat, 'utf8_binary'), 5) as utf8_binary_myconcat, substr(collate(my_concat, 'utf8_binary_lcase'), 5, 5) as utf8_binary_lcase_myconcat, left(collate(my_concat, 'unicode_ci'), 1) as unicode_ci_myconcat
    from mytable").show(1)
        | 
   +----------------+--------------------+--------------------------+-------------------+
   |unicode_myconcat|utf8_binary_myconcat|utf8_binary_lcase_myconcat|unicode_ci_myconcat|
   +----------------+--------------------+--------------------------+-------------------+
   |        栣䌳벣ሰ‸|            ೑顆첤鞅ల|                  ‸᯻箲䨂䡶|                 栣|
   +----------------+--------------------+--------------------------+-------------------+
   only showing top 1 rows
   
   
   scala>      spark.sql("select COLLATION(left(collate(my_concat, 'unicode'), 5)) as unicode_myconcat, COLLATION(right(collate(my_concat, 'utf8_binary'), 5)) as utf8_binary_myconcat, COLLATION(substr(collate(my_concat, 'utf8_binary_lcase'), 5, 5)) as utf8_binary_lcase_myconcat, COLLATION(left(collate(my_conc
   at, 'unicode_ci'), 1)) as unicode_ci_myconcat from mytable").show(1)
        | 
   +----------------+--------------------+--------------------------+-------------------+
   |unicode_myconcat|utf8_binary_myconcat|utf8_binary_lcase_myconcat|unicode_ci_myconcat|
   +----------------+--------------------+--------------------------+-------------------+
   |         UNICODE|         UTF8_BINARY|         UTF8_BINARY_LCASE|         UNICODE_CI|
   +----------------+--------------------+--------------------------+-------------------+
   only showing top 1 rows
   
   
   scala>     spark.sql("select COLLATION(left(collate(my_concat, 'unicode'), 5)) as unicode_myconcat, COLLATION(right(collate(my_concat, 'utf8_binary'), 5)) as utf8_binary_myconcat, COLLATION(substr(collate(my_concat, 'utf8_binary_lcase'), 5, 5)) as utf8_binary_lcase_myconcat, COLLATION(left(collate(my_conca
   t, 'unicode_ci'), 1)) as unicode_ci_myconcat from mytable").write 
   val res5: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@6dd31bab
   
   scala>     spark.sql("select left(collate(my_concat, 'unicode'), 5) as unicode_myconcat, right(collate(my_concat, 'utf8_binary'), 5) as utf8_binary_myconcat, substr(collate(my_concat, 'utf8_binary_lcase'), 5, 5) as utf8_binary_lcase_myconcat, left(collate(my_concat, 'unicode_ci'), 1) as unicode_ci_myconcat
    from mytable")
        |     .write.mode("overwrite").parquet("mytable.parquet")
   
   24/04/02 10:10:24 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory 
   scala> 
   
   scala>     val readdf = spark.read.parquet("mytable.parquet")
        | 
   val readdf: org.apache.spark.sql.DataFrame = [unicode_myconcat: string collate UNICODE, utf8_binary_myconcat: string ... 2 more fields]
   
   scala>     readdf.show(1, 50, true)
        | 
   -RECORD 0------------------------------
    unicode_myconcat           | 栣䌳벣ሰ‸ 
    utf8_binary_myconcat       | ೑顆첤鞅ల 
    utf8_binary_lcase_myconcat | ‸᯻箲䨂䡶 
    unicode_ci_myconcat        | 栣    
   only showing top 1 rows
   
   
   scala>     readdf.createOrReplaceTempView("mytable2")
        | 
   
   scala>         spark.sql("select COLLATION(unicode_myconcat), COLLATION(utf8_binary_myconcat), COLLATION(utf8_binary_lcase_myconcat), COLLATION(unicode_ci_myconcat) from mytable2").show(1)
        | 
   +---------------------------+-------------------------------+-------------------------------------+------------------------------+
   |collation(unicode_myconcat)|collation(utf8_binary_myconcat)|collation(utf8_binary_lcase_myconcat)|collation(unicode_ci_myconcat)|
   +---------------------------+-------------------------------+-------------------------------------+------------------------------+
   |                    UNICODE|                    UTF8_BINARY|                    UTF8_BINARY_LCASE|                    UNICODE_CI|
   +---------------------------+-------------------------------+-------------------------------------+------------------------------+
   only showing top 1 rows
   
   
   scala> 
   :quit
   gideon@Gideon's MacBook Pro spark % ./bin/pyspark 
   Python 3.9.6 (default, Feb  3 2024, 15:58:27) 
   [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
   >>> df = spark.range(1000 * 1000).toDF("id")
   >>> import pyspark.sql.functions as F
   >>> import random
   >>> import string
   >>> dff = df.withColumn("random_string", F.lit("".join([random.choice(string.ascii_letters) for _ in range(25)])))
   >>> dff.show(1, 100, True)
   -RECORD 0----------------------------------
    id            | 0                         
    random_string | fWuOlqjSthNLbHQUuxGizXuKX 
   only showing top 1 rows
   
   >>> dfff = dff.withColumn("my_substring", F.substring(F.col("random_string"), 5, 5)) \
   ...     .withColumn("my_leftstring", F.left(F.col("random_string"),  F.lit(5))) \
   ...     .withColumn("my_rightstring", F.right(F.col("random_string"), F.lit(5))) \
   ...     .withColumn("my_concat", F.concat(F.col("my_substring"), F.col("my_leftstring"), F.col("my_rightstring")))
   >>> 
   >>> dfff.show(10, 100, True)
   -RECORD 0-----------------------------------
    id             | 0                         
    random_string  | fWuOlqjSthNLbHQUuxGizXuKX 
    my_substring   | lqjSt                     
    my_leftstring  | fWuOl                     
    my_rightstring | zXuKX                     
    my_concat      | lqjStfWuOlzXuKX           
   only showing top 1 rows
   
   >>> dfff.createOrReplaceTempView("mytable")
   >>> spark.sql("select * from mytable").show(1)
   +---+--------------------+------------+-------------+--------------+---------------+
   | id|       random_string|my_substring|my_leftstring|my_rightstring|      my_concat|
   +---+--------------------+------------+-------------+--------------+---------------+
   |  0|fWuOlqjSthNLbHQUu...|       lqjSt|        fWuOl|         zXuKX|lqjStfWuOlzXuKX|
   +---+--------------------+------------+-------------+--------------+---------------+
   only showing top 1 rows
   
   >>> spark.sql("select COLLATION( random_string), COLLATION( my_substring), COLLATION( my_leftstring), COLLATION( my_rightstring), COLLATION( my_concat) from mytable").show()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/gideon/repos/spark/python/pyspark/sql/session.py", line 1711, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
     File "/Users/gideon/repos/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/Users/gideon/repos/spark/python/pyspark/errors/exceptions/captured.py", line 221, in deco
       raise converted from None
   pyspark.errors.exceptions.captured.AnalysisException: [UNSUPPORTED_FEATURE.COLLATION] The feature is not supported: Collation is not yet supported. SQLSTATE: 0A000;
   Project [collation(random_string#4) AS collation(random_string)#107, collation(my_substring#29) AS collation(my_substring)#108, collation(my_leftstring#33) AS collation(my_leftstring)#109, collation(my_rightstring#38) AS collation(my_rightstring)#110, collation(my_concat#44) AS collation(my_concat)#111]
   +- SubqueryAlias mytable
      +- View (`mytable`, [id#2L, random_string#4, my_substring#29, my_leftstring#33, my_rightstring#38, my_concat#44])
         +- Project [id#2L, random_string#4, my_substring#29, my_leftstring#33, my_rightstring#38, concat(my_substring#29, my_leftstring#33, my_rightstring#38) AS my_concat#44]
            +- Project [id#2L, random_string#4, my_substring#29, my_leftstring#33, right(random_string#4, 5) AS my_rightstring#38]
               +- Project [id#2L, random_string#4, my_substring#29, left(random_string#4, 5) AS my_leftstring#33]
                  +- Project [id#2L, random_string#4, substring(random_string#4, 5, 5) AS my_substring#29]
                     +- Project [id#2L, fWuOlqjSthNLbHQUuxGizXuKX AS random_string#4]
                        +- Project [id#0L AS id#2L]
                           +- Range (0, 1000000, step=1, splits=Some(16))    
   >>> spark.conf.set("spark.sql.collation.enabled", "true")
   >>> spark.sql("select COLLATION( random_string), COLLATION( my_substring), COLLATION( my_leftstring), COLLATION( my_rightstring), COLLATION( my_concat) from mytable").show(1)
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   |collation(random_string)|collation(my_substring)|collation(my_leftstring)|collation(my_rightstring)|collation(my_concat)|
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   |             UTF8_BINARY|            UTF8_BINARY|             UTF8_BINARY|              UTF8_BINARY|         UTF8_BINARY|
   +------------------------+-----------------------+------------------------+-------------------------+--------------------+
   only showing top 1 rows
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] [WIP - DONT REVIEW] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45738:
URL: https://github.com/apache/spark/pull/45738#discussion_r1555227674


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -86,6 +89,43 @@ class CollationStringExpressionsSuite extends QueryTest
     testRepeat("UNICODE_CI", 3, "abc", 2)
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   yeah, let's put all these tests that are specific to string functions into the separate suite `CollationStringExpressionsSuite`
   
   this way, `CollationSuite` won't get cluttered as we continue adding more and more support for various string expressions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2048923852

   @GideonPotok nice work, thanks!
   
   Heads up though: we will soon be finishing some code refactoring related to collation-aware string expression support ([SPARK-47410](https://issues.apache.org/jira/browse/SPARK-47410)), and will likely need to rewrite this PRs a bit in order to comply with some new design before proceeding to final reivew
   
   let's put this PR on hold for now, and I'll ping you when we're ready to move on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on PR #45738:
URL: https://github.com/apache/spark/pull/45738#issuecomment-2046317362

   did you try to run this suite locally and investigate any potential issues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org