You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2020/04/27 09:02:00 UTC

[jira] [Created] (IMPALA-9695) Support incomplete partition spec in REFRESH statement

Quanlong Huang created IMPALA-9695:
--------------------------------------

             Summary: Support incomplete partition spec in REFRESH statement
                 Key: IMPALA-9695
                 URL: https://issues.apache.org/jira/browse/IMPALA-9695
             Project: IMPALA
          Issue Type: New Feature
          Components: Catalog
            Reporter: Quanlong Huang


We support explicitly specify a partition in the REFRESH statement. When users have several partitions to refresh, they have to trigger several REFRESH statements. Each REFRESH statement requires the table lock so they'll be executed in the catalogd one by one. What's worse, the table is updated (catalog version bumped) several times, which may cause catalogd propagates it several times to the coordinators. It's bad for huge tables that contain a large number of partitions. Their catalog objects have huge size since catalogd can't send incremental updates for only changed partitions.

A possible scenario is hourly partitioned tables that have more than one level partition keys:
{code:sql}
create table hourly_part_tbl (id int, msg string)
partitioned by (hour_id bigint, event_type bigint)
{code}
Let's say there are 20 event_types. Every hour there will be 10 partitions generated with a new hour_id. If the retention time for this table is 2 years, the total number of partitions will be 2 * 365 * 24 * 20 = 175,200. The catalog object size for this table wil be huge, especially there will be many columns and hence incrementa stats in practise.

Every hour, users have to run 20 REFRESH statements one by one on this table. The catalog server will send 20 updates to coordinators for this table. It's possible that catalogd is always busy in loading metadata for this table in a busy cluster (with many other tables).

One possible solution is using REFRESH without the partition spec. Unfortunately, we still load FileStatus for all loaded partitions. It's possible that this single statement can't finish in an hour.

Another solution is support REFRESH statement with incomplete partition spec. So users can use one statement:
{code:java}
REFRESH hourly_part_tbl PARTITION(hour_id=xxx);
{code}
Then catalogd only needs to acquire the table lock once and send its catalog update once.

It'd also be usefull if we support non-equality predicates in the partition spec:
{code:sql}
REFRESH hourly_part_tbl PARTITION(hour_id >= xxx);
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org