You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/06/11 10:26:00 UTC

[jira] [Commented] (IMPALA-7533) Optimize fetch-from-catalog by caching partitions across table versions

    [ https://issues.apache.org/jira/browse/IMPALA-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361595#comment-17361595 ] 

Quanlong Huang commented on IMPALA-7533:
----------------------------------------

IMPALA-7533: Cache partitions across table versions in LocalCatalog

In LocalCatlaog cache, partition metadata is cached with a composed key
of table name, table version and partition id. Whenever the table
version bumps, e.g. due to comments being changed, all the cached
partition values will be unreachable. Following queries have to reload
the partitions and cache them with the new table version. Actually, the
partition id is an unique id across the whole catalog. It's sufficient
to identify the partition. However, there are no partition level
invalidations if the partition is modified in-place in catalogd. So we
have to include the table version and depend on it.

After IMPALA-9778, there are no in-place modifications on partition
metadata. We can safely reuse partition meta across table versions in
LocalCatalog cache. This patch removes the table name and version in the
partition cache key. So metadata of unchanged partitions can be reused
when table version bumps.

Tests:
 - Add tests in test_local_catalog.py to verify the partition metadata
   is reused based on profile metrics.

Change-Id: I512f735b596bc51d553e6d395d108f49727619ed
Reviewed-on: http://gerrit.cloudera.org:8080/16081
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Vihang Karajgaonkar <vi...@cloudera.com>

> Optimize fetch-from-catalog by caching partitions across table versions
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-7533
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7533
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Todd Lipcon
>            Assignee: Quanlong Huang
>            Priority: Major
>              Labels: catalog-v2
>             Fix For: Impala 4.0
>
>
> Currently, the cached partition-level information in CatalogdMetaProvider is tied to a particular version number of its containing table. This means that if the table is modified in any way (eg even a comment changes) all of the partitions are effectively invalidated and need to be re-loaded from catalogd.
> We could avoid this invalidation-and-refetch in a couple ways:
> 1) make partitions immutable given an ID. Instead of modifying partitions in place, we could drop the partition and add a new one with a new ID. This is already done in several code paths, but not all. If we did this, then we'd just need to invalidate the partition _list_ for a table, and when we fetched the new list, we'd see which partitions changed and need to be reloaded.
> 2) add a partition-level version/sequence number which is modified whenever the partition is mutated in place. If we fetched that as part of the partition list, and used it as part of the cache key, we could avoid invalidating partitions when nothing changed. This would have the cost of 4 or 8 bytes per partition (perhaps manageable considering the hundreds of bytes saved by recent patches)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org