You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2024/03/06 11:38:41 UTC

[PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

dbatomic opened a new pull request, #45405:
URL: https://github.com/apache/spark/pull/45405

### What changes were proposed in this pull request?

With this change we move away from using collation names as string literals and start treating them as identifiers, since that is the part of sql standard.

Collation names are marked as multi part identifiers, since, in future, we will want to introduce user defined collations which can be part of nested namespaces in catalog.

### Why are the changes needed?

Aligning with sql standard on collation syntax.

### Does this PR introduce _any_ user-facing change?

Yes. Collations are still not a released feature.

### How was this patch tested?

Existing tests are used.

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a better rule for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1514563440


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.multipartIdentifier().getText

Review Comment:
   This is a bit confusing. How do we turn a multi part identifier into a single string? using dot to connect? where is the implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515983493


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.multipartIdentifier().getText

Review Comment:
   Yeah, let's keep this as an identifier for now. In future, collation name may become multipart identifier, e.g. once we introduce support for user defined collations that are part of a catalog.
   
   But for now, collation name is just an identifier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517702464


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   @cloud-fan the identifier was changes to the single part one. I guess the parser catches just `a`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516378011


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, I would improve the error message to:
   ```
   [COLLATION_INVALID_NAME] `test_collation` does not represent a correct collation name. Suggested valid collation name: UCS_BASIC. SQLSTATE: 42704
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517662569


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   For `COLLATE a . b`, do we return `a . b` or `a.b`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516357243


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   why would we want to capture `BLA_BLA-1`? afaik immediate follow symbol can be an expression and thus `errorCapturingIdentifier` should not be used right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already scattered enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516378011


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, I would improve the error message to:
   ```
   [COLLATION_INVALID_NAME] `test_collation` does not represent a correct collation name. Suggested valid collation name: UCS_BASIC. SQLSTATE: 42704
   ```
   and attach a query context like in:
   ```
   spark-sql (default)> select 'aaa' from test-table;
   
   [INVALID_IDENTIFIER] The identifier test-table is invalid. Please, consider quoting it with back-quotes as `test-table`. SQLSTATE: 42602 (line 1, pos 22)
   
   == SQL ==
   select 'aaa' from test-table
   ----------------------^^^
   ```
   @dbatomic Could you open an JIRA for that, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #45405:
URL: https://github.com/apache/spark/pull/45405#issuecomment-1985446548

   +1, LGTM. Merging to master.
   Thank you, @dbatomic and @stefankandic @srielau @cloud-fan @uros-db for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516396458


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   what about this case:
   ```code
   select 'a' collate UNICODE-columnB
   ```
   
   shouldn't we report that the types are incompatible for minus operation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a better rule for this (if any)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516062859


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, not `errorCapturingIdentifier`? Do you have a test for `... COLLATE BLA_BLA-1`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517702464


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   @cloud-fan the identifier was changes to the single part. I guess the parser just catches just `a`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.

uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already scattered enough, and I think there's plenty of other places that may require changes w/ respect to naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "srielau (via GitHub)" <gi...@apache.org>.

srielau commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1514694683


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   Silly question, isn't this a good time to switch from UCS_BASIC to UTF8_BINARY?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516396458


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   what about this case:
   ```code
   select 'a' collate UNICODE-columnB
   ```
   
   then the error would be:
   ```code
   [INVALID_IDENTIFIER] The identifier UNICODE-columnB is invalid. Please, consider quoting it with back-quotes as `UNICODE-columnB`. SQLSTATE: 42602
   ```
   
   instead of giving the error that types are incompatible for minus operation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516070311


##########
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##########
@@ -117,7 +117,7 @@ object DataType {
   private val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r
   private val CHAR_TYPE = """char\(\s*(\d+)\s*\)""".r
   private val VARCHAR_TYPE = """varchar\(\s*(\d+)\s*\)""".r
-  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+'([\w_]+)'""".r
+  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+([\w_]+)""".r

Review Comment:
   When you define `collationName=identifier` where `identifier` can be:
   ```
   quotedIdentifier
       : BACKQUOTED_IDENTIFIER
       | {double_quoted_identifiers}? DOUBLEQUOTED_STRING
       ;
   ``` 
   but `\w` captures only `[a-zA-Z_0-9]`. Is it in purpose, and what happens if an user pass the collation name as
   ```
   `UCS_BASIC_LCASE` or
   "UCS_BASIC_LCASE"
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517829360


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   oh I missed the new update to change to single part. nvm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.

dbatomic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515984970


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   yeah, let's split the changes in two PRs. Name switch will require additional changes on delta side as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516415830


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   I agree, it would be better to report about incompatible types rather invalid identifier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516535896


##########
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##########
@@ -117,7 +117,7 @@ object DataType {
   private val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r
   private val CHAR_TYPE = """char\(\s*(\d+)\s*\)""".r
   private val VARCHAR_TYPE = """varchar\(\s*(\d+)\s*\)""".r
-  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+'([\w_]+)'""".r
+  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+([\w_]+)""".r

Review Comment:
   Nice catch! Since `collate_key_word_as_identifier` is false I guess we only need to support backticks. I added the tests for them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516373801


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   Let's look at the examples:
   ```
   spark-sql (default)> select 'aaa' collate test_collation;
   [COLLATION_INVALID_NAME] The value test_collation does not represent a correct collation name. Suggested valid collation name: [UCS_BASIC]. SQLSTATE: 42704
   ```
   The error message says about `test_collation`, that's correct but:
   ```
   spark-sql (default)> select 'aaa' collate test-collation;
   [COLLATION_INVALID_NAME] The value test does not represent a correct collation name. Suggested valid collation name: [UNICODE]. SQLSTATE: 42704
   ```
   it points out `test`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #45405: [SPARK-47302][SQL] Collate keyword as identifier
URL: https://github.com/apache/spark/pull/45405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org