You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "anchovYu (via GitHub)" <gi...@apache.org> on 2023/12/13 06:17:57 UTC

[PR] [WIP] Fix spark.catalog.listDatabases() issues on schemas with special characters [spark]

anchovYu opened a new pull request, #44328:
URL: https://github.com/apache/spark/pull/44328

   ### What changes were proposed in this pull request?
   When the SQL conf `spark.sql.legacy.keepCommandOutputSchema` is set to true:
   Before:
   ```
   // support there is a xyyu-db-with-hyphen schema in the catalog
   spark.catalog.listDatabases()
   
   [INVALID_IDENTIFIER] The identifier xyyu-db-with-hyphen is invalid. Please, consider quoting it with back-quotes as `xyyu-db-with-hyphen`. SQLSTATE: 42602 (line 1, pos 4)
   ```
   
   After:
   ```
   spark.catalog.listDatabases()
   
   .. `xyyu-db-with-hyphen` ..
   ```
   
   This PR fixes the issue by forcing the problematic step in listDatabases to be executed with non-legacy mode, so that the value returned is guaranteed to be quoted.
   
   
   ### Why are the changes needed?
   To fix the bug.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Newly added tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44328:
URL: https://github.com/apache/spark/pull/44328#discussion_r1425836204


##########
sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:
##########
@@ -167,6 +167,44 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf
       Set("default", "my_db2"))
   }
 
+  test("list databases with special character") {
+    Seq(true, false).foreach { legacy =>
+      withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> legacy.toString) {
+        spark.catalog.setCurrentCatalog(CatalogManager.SESSION_CATALOG_NAME)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet == Set("default"))
+        // use externalCatalog to bypass the database name validation in SessionCatalog
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my-db1"), false)
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my`db2"), false)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet ==
+          Set("default", "`my-db1`", "`my``db2`"))
+        if (legacy) {
+          assert(
+            spark.catalog.listDatabases("my*").collect().map(_.name).toSet ==
+              Set("`my-db1`", "`my``db2`")
+          )
+          assert(spark.catalog.listDatabases("`my*`").collect().map(_.name).toSet == Set.empty)
+        } else {
+          assert(spark.catalog.listDatabases("my*").collect().map(_.name).toSet == Set.empty)

Review Comment:
   why is there a behavior difference here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44328:
URL: https://github.com/apache/spark/pull/44328#discussion_r1425834404


##########
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##########
@@ -100,7 +109,9 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {
       case ShowNamespaces(r: ResolvedNamespace, _, _) => r.catalog
     }.get
     val databases = qe.toRdd.collect().map { row =>
-      makeDatabase(Some(catalog.name()), row.getString(0))
+      // dbName can either be a quoted identifier (single or multi part) or an unquoted single part
+      val dbName = row.getString(0)
+      makeDatabase(Some(catalog.name()), dbName)

Review Comment:
   seems unnecessary change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "anchovYu (via GitHub)" <gi...@apache.org>.

anchovYu commented on code in PR #44328:
URL: https://github.com/apache/spark/pull/44328#discussion_r1425951074


##########
sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:
##########
@@ -167,6 +167,44 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf
       Set("default", "my_db2"))
   }
 
+  test("list databases with special character") {
+    Seq(true, false).foreach { legacy =>
+      withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> legacy.toString) {
+        spark.catalog.setCurrentCatalog(CatalogManager.SESSION_CATALOG_NAME)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet == Set("default"))
+        // use externalCatalog to bypass the database name validation in SessionCatalog
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my-db1"), false)
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my`db2"), false)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet ==
+          Set("default", "`my-db1`", "`my``db2`"))
+        if (legacy) {
+          assert(
+            spark.catalog.listDatabases("my*").collect().map(_.name).toSet ==
+              Set("`my-db1`", "`my``db2`")
+          )
+          assert(spark.catalog.listDatabases("`my*`").collect().map(_.name).toSet == Set.empty)
+        } else {
+          assert(spark.catalog.listDatabases("my*").collect().map(_.name).toSet == Set.empty)

Review Comment:
   Added a todo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44328:
URL: https://github.com/apache/spark/pull/44328#issuecomment-1855208319

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "anchovYu (via GitHub)" <gi...@apache.org>.

anchovYu commented on code in PR #44328:
URL: https://github.com/apache/spark/pull/44328#discussion_r1425888806


##########
sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:
##########
@@ -167,6 +167,44 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf
       Set("default", "my_db2"))
   }
 
+  test("list databases with special character") {
+    Seq(true, false).foreach { legacy =>
+      withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> legacy.toString) {
+        spark.catalog.setCurrentCatalog(CatalogManager.SESSION_CATALOG_NAME)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet == Set("default"))
+        // use externalCatalog to bypass the database name validation in SessionCatalog
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my-db1"), false)
+        spark.sharedState.externalCatalog.createDatabase(utils.newDb("my`db2"), false)
+        assert(spark.catalog.listDatabases().collect().map(_.name).toSet ==
+          Set("default", "`my-db1`", "`my``db2`"))
+        if (legacy) {
+          assert(
+            spark.catalog.listDatabases("my*").collect().map(_.name).toSet ==
+              Set("`my-db1`", "`my``db2`")
+          )
+          assert(spark.catalog.listDatabases("`my*`").collect().map(_.name).toSet == Set.empty)
+        } else {
+          assert(spark.catalog.listDatabases("my*").collect().map(_.name).toSet == Set.empty)

Review Comment:
   It's the existing behavior in SQL : ).
   With this conf flipped,  sql query `SHOW SCHEMAS LIKE` actually requires different pattern matching. This is because it does the quoting before pattern matching in the `ShowNamespaceExec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "anchovYu (via GitHub)" <gi...@apache.org>.

anchovYu commented on code in PR #44328:
URL: https://github.com/apache/spark/pull/44328#discussion_r1425886427


##########
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##########
@@ -100,7 +109,9 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {
       case ShowNamespaces(r: ResolvedNamespace, _, _) => r.catalog
     }.get
     val databases = qe.toRdd.collect().map { row =>
-      makeDatabase(Some(catalog.name()), row.getString(0))
+      // dbName can either be a quoted identifier (single or multi part) or an unquoted single part
+      val dbName = row.getString(0)
+      makeDatabase(Some(catalog.name()), dbName)

Review Comment:
   Yes, but it's easier to reference in the comment (`dbName` instead of something like "returned results" or `row.getString(0)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #44328: [SPARK-46394][SQL] Fix spark.catalog.listDatabases() issues on schemas with special characters when `spark.sql.legacy.keepCommandOutputSchema` set to true 
URL: https://github.com/apache/spark/pull/44328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org