You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kyuubi.apache.org by ch...@apache.org on 2024/01/29 06:21:31 UTC

(kyuubi) branch branch-1.8 updated: [KYUUBI #6018] Speed up GetTables operation for Spark session catalog

This is an automated email from the ASF dual-hosted git repository.

chengpan pushed a commit to branch branch-1.8
in repository https://gitbox.apache.org/repos/asf/kyuubi.git


The following commit(s) were added to refs/heads/branch-1.8 by this push:
     new f86628f3c [KYUUBI #6018] Speed up GetTables operation for Spark session catalog
f86628f3c is described below

commit f86628f3cc856265ff8214599836205d2cef2f5c
Author: Cheng Pan <ch...@apache.org>
AuthorDate: Mon Jan 29 14:21:09 2024 +0800

    [KYUUBI #6018] Speed up GetTables operation for Spark session catalog
    
    # :mag: Description
    ## Issue References ๐Ÿ”—
    
    This pull request aims to speed up the GetTables operation for the Spark session catalog.
    As reported in https://github.com/apache/kyuubi/discussions/4956, https://github.com/apache/kyuubi/discussions/5949, the GetTables operation is quite slow in some cases, and in https://github.com/apache/kyuubi/pull/4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.
    
    ## Describe Your Solution ๐Ÿ”ง
    
    Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.
    
    Currently, the basic step of GetTables in the Spark engine is
    ```
    val catalog: String = getCatalog(spark, catalogName)
    val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
    val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
    val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
    ```
    then filter `tableObjects` with `tableTypes: Set[String]`.
    
    The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.
    
    For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)
    
    ## Types of changes :bookmark:
    
    - [ ] Bugfix (non-breaking change which fixes an issue)
    - [x] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing functionality to change)
    
    ## Test Plan ๐Ÿงช
    
    Pass GA
    
    ---
    
    # Checklist ๐Ÿ“
    
    - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
    
    **Be nice. Be informative.**
    
    Closes #6018 from pan3793/fast-get-table.
    
    Closes #6018
    
    058001c6f [Cheng Pan] fix
    405b12484 [Cheng Pan] fix
    615b7470f [Cheng Pan] Speed up GetTables operation
    
    Authored-by: Cheng Pan <ch...@apache.org>
    Signed-off-by: Cheng Pan <ch...@apache.org>
    (cherry picked from commit d474768d97e85218183c880528a992a2dc258229)
    Signed-off-by: Cheng Pan <ch...@apache.org>
---
 docs/configuration/settings.md                     |  2 +-
 .../engine/spark/util/SparkCatalogUtils.scala      | 37 ++++++++++++++++------
 .../org/apache/kyuubi/config/KyuubiConf.scala      |  3 +-
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/docs/configuration/settings.md b/docs/configuration/settings.md
index 265d89795..69d220337 100644
--- a/docs/configuration/settings.md
+++ b/docs/configuration/settings.md
@@ -376,7 +376,7 @@ You can configure the Kyuubi properties in `$KYUUBI_HOME/conf/kyuubi-defaults.co
 
 |                       Key                        |                                     Default                                     |                                                                                                                                                                                                                                                 Meaning                                                                                                               [...]
 |--------------------------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...]
-| kyuubi.operation.getTables.ignoreTableProperties | false                                                                           | Speed up the `GetTables` operation by returning table identities only.                                                                                                                                                                                                                                                                                                [...]
+| kyuubi.operation.getTables.ignoreTableProperties | false                                                                           | Speed up the `GetTables` operation by ignoring `tableTypes` query criteria, and returning table identities only.                                                                                                                                                                                                                                                      [...]
 | kyuubi.operation.idle.timeout                    | PT3H                                                                            | Operation will be closed when it's not accessed for this duration of time                                                                                                                                                                                                                                                                                             [...]
 | kyuubi.operation.interrupt.on.cancel             | true                                                                            | When true, all running tasks will be interrupted if one cancels a query. When false, all running tasks will remain until finished.                                                                                                                                                                                                                                    [...]
 | kyuubi.operation.language                        | SQL                                                                             | Choose a programing language for the following inputs<ul><li>SQL: (Default) Run all following statements as SQL queries.</li><li>SCALA: Run all following input as scala codes</li><li>PYTHON: (Experimental) Run all following input as Python codes with Spark engine</li></ul>                                                                                     [...]
diff --git a/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/util/SparkCatalogUtils.scala b/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/util/SparkCatalogUtils.scala
index 18a14494e..b55319830 100644
--- a/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/util/SparkCatalogUtils.scala
+++ b/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/util/SparkCatalogUtils.scala
@@ -163,8 +163,8 @@ object SparkCatalogUtils extends Logging {
     val namespaces = listNamespacesWithPattern(catalog, schemaPattern)
     catalog match {
       case builtin if builtin.name() == SESSION_CATALOG =>
-        val catalog = spark.sessionState.catalog
-        val databases = catalog.listDatabases(schemaPattern)
+        val sessionCatalog = spark.sessionState.catalog
+        val databases = sessionCatalog.listDatabases(schemaPattern)
 
         def isMatchedTableType(tableTypes: Set[String], tableType: String): Boolean = {
           val typ = if (tableType.equalsIgnoreCase(VIEW)) VIEW else TABLE
@@ -172,22 +172,39 @@ object SparkCatalogUtils extends Logging {
         }
 
         databases.flatMap { db =>
-          val identifiers = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
-          catalog.getTablesByName(identifiers)
-            .filter(t => isMatchedTableType(tableTypes, t.tableType.name)).map { t =>
-              val typ = if (t.tableType.name == VIEW) VIEW else TABLE
+          val identifiers =
+            sessionCatalog.listTables(db, tablePattern, includeLocalTempViews = false)
+          if (ignoreTableProperties) {
+            identifiers.map { ti: TableIdentifier =>
               Row(
                 catalogName,
-                t.database,
-                t.identifier.table,
-                typ,
-                t.comment.getOrElse(""),
+                ti.database.getOrElse("default"),
+                ti.table,
+                TABLE, // ignore tableTypes criteria and simply treat all table type as TABLE
+                "",
                 null,
                 null,
                 null,
                 null,
                 null)
             }
+          } else {
+            sessionCatalog.getTablesByName(identifiers)
+              .filter(t => isMatchedTableType(tableTypes, t.tableType.name)).map { t =>
+                val typ = if (t.tableType.name == VIEW) VIEW else TABLE
+                Row(
+                  catalogName,
+                  t.database,
+                  t.identifier.table,
+                  typ,
+                  t.comment.getOrElse(""),
+                  null,
+                  null,
+                  null,
+                  null,
+                  null)
+              }
+          }
         }
       case tc: TableCatalog =>
         val tp = tablePattern.r.pattern
diff --git a/kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala b/kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala
index 70a7e65c4..b22a5131f 100644
--- a/kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala
+++ b/kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala
@@ -3226,7 +3226,8 @@ object KyuubiConf {
 
   val OPERATION_GET_TABLES_IGNORE_TABLE_PROPERTIES: ConfigEntry[Boolean] =
     buildConf("kyuubi.operation.getTables.ignoreTableProperties")
-      .doc("Speed up the `GetTables` operation by returning table identities only.")
+      .doc("Speed up the `GetTables` operation by ignoring `tableTypes` query criteria, " +
+        "and returning table identities only.")
       .version("1.8.0")
       .booleanConf
       .createWithDefault(false)