You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dbatomic (via GitHub)" <gi...@apache.org> on 2024/03/06 11:38:41 UTC

[PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

dbatomic opened a new pull request, #45405:
URL: https://github.com/apache/spark/pull/45405

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   With this change we move away from using collation names as string literals and start treating them as identifiers, since that is the part of sql standard.
   
   Collation names are marked as multi part identifiers, since, in future, we will want to introduce user defined collations which can be part of nested namespaces in catalog.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Aligning with sql standard on collation syntax.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes. Collations are still not a released feature.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Existing tests are used.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a better rule for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1514563440


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.multipartIdentifier().getText

Review Comment:
   This is a bit confusing. How do we turn a multi part identifier into a single string? using dot to connect? where is the implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515983493


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.multipartIdentifier().getText

Review Comment:
   Yeah, let's keep this as an identifier for now. In future, collation name may become multipart identifier, e.g. once we introduce support for user defined collations that are part of a catalog.
   
   But for now, collation name is just an identifier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517702464


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   @cloud-fan the identifier was changes to the single part one. I guess the parser catches just `a`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516378011


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, I would improve the error message to:
   ```
   [COLLATION_INVALID_NAME] `test_collation` does not represent a correct collation name. Suggested valid collation name: UCS_BASIC. SQLSTATE: 42704
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517662569


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   For `COLLATE a . b`, do we return `a . b` or `a.b`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516357243


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   why would we want to capture `BLA_BLA-1`? afaik immediate follow symbol can be an expression and thus `errorCapturingIdentifier` should not be used right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already scattered enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516378011


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, I would improve the error message to:
   ```
   [COLLATION_INVALID_NAME] `test_collation` does not represent a correct collation name. Suggested valid collation name: UCS_BASIC. SQLSTATE: 42704
   ```
   and attach a query context like in:
   ```
   spark-sql (default)> select 'aaa' from test-table;
   
   [INVALID_IDENTIFIER] The identifier test-table is invalid. Please, consider quoting it with back-quotes as `test-table`. SQLSTATE: 42602 (line 1, pos 22)
   
   == SQL ==
   select 'aaa' from test-table
   ----------------------^^^
   ```
   @dbatomic Could you open an JIRA for that, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45405:
URL: https://github.com/apache/spark/pull/45405#issuecomment-1985446548

   +1, LGTM. Merging to master.
   Thank you, @dbatomic and @stefankandic @srielau @cloud-fan @uros-db for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516396458


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   what about this case:
   ```code
   select 'a' collate UNICODE-columnB
   ```
   
   shouldn't we report that the types are incompatible for minus operation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a better rule for this (if any)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516062859


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   BTW, not `errorCapturingIdentifier`? Do you have a test for `... COLLATE BLA_BLA-1`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517702464


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   @cloud-fan the identifier was changes to the single part. I guess the parser just catches just `a`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "uros-db (via GitHub)" <gi...@apache.org>.
uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already scattered enough, and I think there's plenty of other places that may require changes w/ respect to naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1514694683


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   Silly question, isn't this a good time to switch from UCS_BASIC to UTF8_BINARY?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516396458


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   what about this case:
   ```code
   select 'a' collate UNICODE-columnB
   ```
   
   then the error would be:
   ```code
   [INVALID_IDENTIFIER] The identifier UNICODE-columnB is invalid. Please, consider quoting it with back-quotes as `UNICODE-columnB`. SQLSTATE: 42602
   ```
   
   instead of giving the error that types are incompatible for minus operation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516070311


##########
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##########
@@ -117,7 +117,7 @@ object DataType {
   private val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r
   private val CHAR_TYPE = """char\(\s*(\d+)\s*\)""".r
   private val VARCHAR_TYPE = """varchar\(\s*(\d+)\s*\)""".r
-  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+'([\w_]+)'""".r
+  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+([\w_]+)""".r

Review Comment:
   When you define `collationName=identifier` where `identifier` can be:
   ```
   quotedIdentifier
       : BACKQUOTED_IDENTIFIER
       | {double_quoted_identifiers}? DOUBLEQUOTED_STRING
       ;
   ``` 
   but `\w` captures only `[a-zA-Z_0-9]`. Is it in purpose, and what happens if an user pass the collation name as
   ```
   `UCS_BASIC_LCASE` or
   "UCS_BASIC_LCASE"
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1517829360


##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala:
##########
@@ -218,6 +218,6 @@ class DataTypeAstBuilder extends SqlBaseParserBaseVisitor[AnyRef] {
    * Returns a collation name.
    */
   override def visitCollateClause(ctx: CollateClauseContext): String = withOrigin(ctx) {
-    string(visitStringLit(ctx.stringLit))
+    ctx.identifier.getText

Review Comment:
   oh I missed the new update to change to single part. nvm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "dbatomic (via GitHub)" <gi...@apache.org>.
dbatomic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515984970


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
             if k != "varchar" and k != "char":
                 self.assertEqual(t(), _parse_datatype_string(k))
         self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
+        self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC"))
         self.assertEqual(StringType(0), _parse_datatype_string("string"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string   COLLATE 'UCS_BASIC'"))
-        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'"))
-        self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'"))
-        self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'"))
+        self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC"))

Review Comment:
   yeah, let's split the changes in two PRs. Name switch will require additional changes on delta side as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516415830


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   I agree, it would be better to report about incompatible types rather invalid identifier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516535896


##########
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##########
@@ -117,7 +117,7 @@ object DataType {
   private val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r
   private val CHAR_TYPE = """char\(\s*(\d+)\s*\)""".r
   private val VARCHAR_TYPE = """varchar\(\s*(\d+)\s*\)""".r
-  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+'([\w_]+)'""".r
+  private val COLLATED_STRING_TYPE = """string\s+COLLATE\s+([\w_]+)""".r

Review Comment:
   Nice catch! Since `collate_key_word_as_identifier` is false I guess we only need to support backticks. I added the tests for them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL][Collation] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1516373801


##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -1096,7 +1096,7 @@ colPosition
     ;
 
 collateClause
-    : COLLATE collationName=stringLit
+    : COLLATE collationName=identifier

Review Comment:
   Let's look at the examples:
   ```
   spark-sql (default)> select 'aaa' collate test_collation;
   [COLLATION_INVALID_NAME] The value test_collation does not represent a correct collation name. Suggested valid collation name: [UCS_BASIC]. SQLSTATE: 42704
   ```
   The error message says about `test_collation`, that's correct but:
   ```
   spark-sql (default)> select 'aaa' collate test-collation;
   [COLLATION_INVALID_NAME] The value test does not represent a correct collation name. Suggested valid collation name: [UNICODE]. SQLSTATE: 42704
   ```
   it points out `test`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47302][SQL] Collate keyword as identifier [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45405: [SPARK-47302][SQL] Collate keyword as identifier
URL: https://github.com/apache/spark/pull/45405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org