You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2019/11/21 14:21:00 UTC

[jira] [Commented] (SPARK-29966) Add version method in TableCatalog to avoid load table twice

    [ https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979312#comment-16979312 ] 

Wenchen Fan commented on SPARK-29966:
-------------------------------------

I do agree that this is a perf regression in Spark 3.0. Now every time we need to look up a v1 table, we need to access Hive metastore twice.

We can cache the table from Hive metastore, but I don't think it's a good idea to introduce cache to fix perf regression. We can follow DDL/DML command resolution, to have a rule only resolve tables from v2 catalogs, and a rule only resolve tables from session catalog.

That said, `ResolveTables` should only lookup table from v2 session catalogs. `ResolveRelations` should lookup table from session catalog and return a v2 relation if the table provider is v2. This will introduce some duplicated code to create v2 relations, but should be fine.

What do you think? [~rdblue] [~brkyvz] [~imback82]

> Add version method in TableCatalog to avoid load table twice
> ------------------------------------------------------------
>
>                 Key: SPARK-29966
>                 URL: https://issues.apache.org/jira/browse/SPARK-29966
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: ulysses you
>            Priority: Minor
>
> Now resolve logic plan will load table twice which are in ResolveTables and ResolveRelations. The ResolveRelations is old code path, and ResolveTables is v2 code path, and the reason why load table twice is that ResolveTables will load table and rollback v1 table to ResolveRelations code path.
> The same scene also exists in ResolveSessionCatalog.
> It affect that execute command will cost double time than spark 2.4.
> Here is the idea that add a table version method in TableCatalog, and rules should always get table version firstly without load table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org