You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/03/12 17:50:00 UTC

[jira] [Commented] (IMPALA-7973) Add support for fine-grained updates at partition level

    [ https://issues.apache.org/jira/browse/IMPALA-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790811#comment-16790811 ] 

ASF subversion and git services commented on IMPALA-7973:
---------------------------------------------------------

Commit 9d67aafaea5ba8673d130dfc7c1e7a0b4f58303b in impala's branch refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9d67aaf ]

IMPALA-7972 Detect self-events to avoid unnecessary invalidates

This patch adds support to detect self-generated events from catalog.
This is used to avoid unnecessary invalidates to the tables from such
self-events. Currently, alter_table, alter_partition, add_partition and
drop_partition event types can invalidate the table metadata.

Originally, we planned to have a global version number support from
metastore (see HIVE-21115). But since that is still not complete, we
rely on a combination of other identifiers to determine if a event is
self-generated or not. These self-event identifiers consists of values
from the table/partition parameters. A catalog service uuid
and the catalog version number. The uuid is generated for each
catalogservice when it comes up and it adds it to the table/partition
parameters with the key "impala.CatalogServiceId". The catalog version
number is added with the key "impala.CatalogVersion".

When catalog executes a DDL operation it appends the current catalog
version to the list of version numbers for the in-flight events for the
table. Events processor clears this version when the corresponding
version number identified by serviceId is received in the event. This is
needed since it is possible that a external non-Impala system which
generates the event presents the same serviceId and version number later
on. The algorithm to detect a self-event is as below.

1. Add the service id and expected catalog version to table/partition
parameters when executing the DDL operation. When the HMS operation is
successful, add the version number to the list of version for in-flight
events at table level.
2. When the event is received, the first time you see the combination of
serviceId and version number, event processor clears the version number
from table's list and determines the event as self-generated (and hence
ignored)
3. If the event data presents a unknown serviceId or if the version
number is not present in the list of in-flight versions, event is not a
self-event and needs to be processed.

In order to limit the total memory footprint, only 10 version numbers
are stored at the table. Since the event processor is expected to poll
every few seconds this should be a reasonable bound which satisfies most
use-cases. Otherwise, event processor may wrongly process a self-event
to invalidate the table. In such a case, its a performance penalty not a
correctness issue.

In case of drop_partition event, the partition object is not available
in the event. Hence we cannot determine if its a self-event. In such
cases currently we always issue a invalidate command. This is a known
limitation and will be improved in IMPALA-7973

Patch adds new tests to trigger alter table/partition DDLs from impala
and makes sure that the table is not invalidated.

Change-Id: I6db0d7f7fe465158fc8cb9d6b6b57a321827b353
Reviewed-on: http://gerrit.cloudera.org:8080/12591
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Add support for fine-grained updates at partition level
> -------------------------------------------------------
>
>                 Key: IMPALA-7973
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7973
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>
> When data is inserted into a partition or a new partition is created in a large table, we should not be invalidating the whole table. Instead it should be possible to refresh/add/drop certain partitions on the table directly based on the event information. This would help with the performance of subsequent access to the table by avoiding reloading the large table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org