You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/04/08 12:05:00 UTC

[jira] [Commented] (IMPALA-7501) Slim down metastore Partition objects in LocalCatalog cache

    [ https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317134#comment-17317134 ] 

Quanlong Huang commented on IMPALA-7501:
----------------------------------------

For the unused fields, I think we should null them out when generating TGetPartialCatalogObjectResponse in catalogd. This reduces the memory pressure on both side.

I did an experiment on a table with 478 columns and 87320 partitions (1 file per partition). When fetching all partitions in one GetPartialCatalogObject() call, the serialized response size is 1823012484 (1.7GB). However, in the legacy catalog mode, when executing REFRESH on the table, the serialized size of TResetMetadataResponse which contains the whole table object is just 71390662 (68MB).

One factor is these unused string fields in hms partitions. The other factor is the partition locations in legacy catalog mode is prefix compressed. In hms partitions, the locations are all full URIs.

cc [~vihangk1]

> Slim down metastore Partition objects in LocalCatalog cache
> -----------------------------------------------------------
>
>                 Key: IMPALA-7501
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7501
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Catalog
>            Reporter: Todd Lipcon
>            Assignee: Quanlong Huang
>            Priority: Minor
>              Labels: catalog-v2
>
> I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit after running a production workload simulation for a couple hours. It had 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M objects are retained by FieldSchema, which, as far as I remember, are ignored on the partition level by the Impala planner. So, with a bit of slimming down of these objects, we could make a huge dent in effective cache capacity given a fixed budget. Reducing object count should also have the effect of improved GC performance (old gen GC is more closely tied to object count than size)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org